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INTRODUCTION 

The  work  described  in  this  report,  which  includes  both  basic 
research  on  automatic  indexing  and  the  design  of  an  operational  system, 
was  performed  at  Documentation  Incorporated  under  the  contract  with  the 
Air  Force  Office  of  Aerospace  Research,  No.  AF  49(604) -4236.  Phase  one 
of  the  project  implementation  was  the  preparation  of  a  state-of-the-art 
survey  and  a  bibliography,  which  are  published  as  Part  I  of  the  report. 

It  includes  a  thorough  evaluation  of  all  the  reported  experiences  and 
results  in  the  automatic  indexing  field.  Phase  two  of  the  project  was 
a  detailed  analysis  of  the  particular  characteristics  of  the  input 
material  for  which  the  automatic  indexing  system  was  to  be  designed. 
Mathematical  models  for  certain  index  formation  processes  were  derived. 
The  results  of  the  findings  are  described  in  Part  II  of  the  report, 
which  also  contains  the  description  of  the  proposed  Formal  Auto-indexing 
of  Scientific  Texts  (FAST)  System.  On  June  30,  1965,  the  Air  Force 
Office  of  Scientific  Research  invited  a  selected  audience  of  representa¬ 
tives  of  government  agencies  and  non-government  organizations  with  vested 
interests  in  information  processing  field  to  a  demonstration  of  this  new 
FAST  system  at  the  Documentation  Incorporated  premises  in  Bethesda,  Md. 
The  opening  remarks  of  Col.  Donald  R.  Currier  of  the  AFOSR  and  of 
Dr.  Mortimer  Taube  of  the  Documentation  Incorporated  follow  this  Intro¬ 
duction. 
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STATEMENT  OF  COL.  DONALD  R.  CURRIER 

1 

This  subject  indexing  work  which  has  been  done  under  the  USE 
contract  is  an  example  of  what  happens  when  the  time  has  arrived  for  a 
good  idea  to  come  to  reality.  It  is  not  becau  .i  of  some  miraculous  tech¬ 
nical  breakthrough  that  we  have  a  demonstrable  system  today  although 
Mr.  Zunde  here  at  Documentation  Incorporated  has  pushed  the  state-of-the- 
art  forward  a  significant  notch.  It  is  because  the  one  missing  ingredient 
in  most  previous  experiments  with  computer  indexing  was  present  this  time. 
This  ingredient  was  a  very  large  store  of  abstracts  that  not  only  had  to 
be  put  in  machine  useable  form,  but  also  had  to  be  hand  subject  indexed 
to  meet  a  basic  ILSE  requirement  for  a  controlled  vocabulary  for  subject 
searches.  All  of  the  costs  to  do  the  above  tasks  could  be  considered 
as  "sunk  costs"  from  the  standpoint  of  the  automatic  indexing  task.  They 
would  be  incurred  anyway  even  if  no  automatic  indexing  research  were  to 
be  done.  Thus,  a  tested  working  media  for  the  next  step  was  all  paid  for. 

Some  extra  money  came  about  because  the  original  estimates 
someone  made  of  what  the  DOD  portion  of  the  ILSE  effort  would  cost  were 
high.  The  money  fell  into  my  hands  just  at  the  time  when  I  had  become 
interested  in  adding  this  sort  of  capability  to  MCDS  and  had  been  dis¬ 
cussing  the  matter  with  the  people  at  Documentation  Incorporated.  It 
was  not  difficult  to  sell  the  idea  to  D  .  Frese,  the  ILSE  Panel  Chairman, 
that  we  might  save  a  lot  of  future  ILSE  money  by  risking  some  of  the 
current  years  surplus  nor  to  convince  him  that  a  modest  extension  to 
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test  the  general  applicability  using  OAR  data  in  other  areas  of  science 
was  worthwhile  from  the  DOD  standpoint. 

This  work  has  the  potential  to  save  the  government  a  great 
deal  of  money  and  people's  time  if  it  is  applied.  More  importantly,  it 
may  be  the  key  to  the  precisely  directed  exchange  of  one  type  of  scientific 
information  on  a  scale  that  has  not  been  possible  before  anywhere. 

I  would  now  like  Dr.  Taube,  Chairman  of  the  Board  of  Documentation 
Incorporated  and  a  man  with  considerable  experience  in  information  retrieval 
to  set  the  stage  for  the  presentation  by  Mr.  Zunde. 
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STATEMENT  OF  DR.  MORTIMER  TAUBE 


As  Colonel  Currier  has  pointed  out,  we  were  able  to  begin  this 
automatic  indexing  project  without  the  necessity  of  investing  in  the  input 
costs  for  a  data  base.  This  permitted  us  to  concentrate  on  the  logic  of 
the  indexing  problem.  Following  our  usual  procedure,  in  order  to  avoid 
re-inventing  the  wheel,  we  did  a  complete  study  of  the  existing  literature 
on  automatic  indexing.  Out  of  this  study  there  emerged  the  conviction 
that  many  organizations  who  have  preceded  us  in  this  area,  have  restricted 
themselves  to  speculating  on  the  number  of  different  ways  to  do  the  job, 
rather  than  on  the  basic  question  of  determining  whether  or  not  automatic 
indexing  was  indeed  feasible  and  could  be  accomplished  with  existing 
equipment  and  program  capability. 

We  discovered  in  this  area,  us  in  many  others,  a  tendency  on 
the  part  of  those  who  speculate  and  are  not  concerned  with  the  solution 
of  operating  problems  to  complicate  the  problems  more  than  is  necessary. 
One  can  devise  many  methods  for  selecting  a  proper  set  of  index  terms 
from  a  machine-readable  text.  The  problem  is  to  determine  the  simplest 
and  most  economical  method  which  will  create  a  usable  index  of  high 
quality.  In  this  field,  as  in  many  others,  it  is  a  conviction  of 
Documentation  Incorporated  that  the  simplest  system  which  works  is  the 
best  system  for  any  particular  application. 
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Documentation  Incorporated  is  internationally  known  for  the 
development  of  coordinate  indexing  which  is  now  standard  operating 
procedure  with  all  organizations  using  manual  indexing  with  computer 
manipulation  of  the  index.  Coordinate  Indexing  is  based  on  the  bet 
that  indexing  can  be  accomplished  with  a  set  of  terms  with  relations 
among  the  terms  limited  to  Boolean  functions  of  "and, ""or,"  and  "not." 
Many  people  have  proposed  adding  much  more  complex  relational  systems, 
but  in  no  case  has  it  been  proved  that  such  complexity  does  more  than 
raise  the  cost  without  improving  the  system.  We  are  aware  that  in  a 
Boolean  system,  we  may  not  be  able  to  distinguish  between  Venetian 
blinds  and  blind  Venetians.  But  we  will  only  worry  about  this  problem 
if  we  are  certain  that  in  our  system  of  information  we  have  stored  an 
equal  amount  of  data  on  both  blind  Venetians  and  Venetian  blinds.  If 
we  have  only  information  on  building  materials,  namely  Venetian  blinds, 
we  will  not  worry  about  the  possibility  of  retrieving  information  on 
blind  Venetians  if  there  is  no  information  on  blind  Venetians  in  the 
system. 


Now  it  turns  out  to  be  the  case  that  many  people  who  have 
developed  elaborate  syntactical  and  semantic  rules  for  automatic  indexing, 
have  done  so  without  regard  to  the  actual  amount  of  noise  or  erroneous 
information  which  might  be  retrieved  with  simpler  and  less  costly  systems; 
therefore  we  have  followed  information  theory  and  have  tried  to  create 
the  freest  and  simplest  system  consistent  with  the  creation  of  an  index 
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adequate  for  the  uses  to  v.'hich  it  will  be  put.  Mr.  Zunde  will  tell 
you  about  the  details  of  this  system.  We  are  not  claiming  a  break¬ 
through  or  any  great  discovery  in  this  regard,  but  merely  another 
demonstration  that  rigorous,  logical  analysis  and  attention  to  the 
requirements  of  theory  and  economic  feasibility  can  deliver  important 
and  usable  operating  answers. 
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PART  I 

STATE-OF-THE-ART  OF  MACHINE  INDEXING 


1.1.  SCOPE  OF  THE  STATE-OF-THE-ART  STUDY 


The  advent  of  computers  has  opened  new  vistas  in  the  information 
processing  field.  Among  the  many  areas  which  have  already  received  some 
consideration  has  been  the  mechanization  of  indexing.  Most  likely  it 
will  result  in  the  elimination  of  much  human  effort  from  the  indexing 
process,  with  the  reduction  of  human  bias  or  distortion  from  the  process 
as  a  secondary  effect. 

This  state-of-the-art  study  briefly  surveys  recent  developments 
in  the  machine  or  automatic  indexing  field.  At  the  present  time,  automatic 
indexing  is  basically  in  an  experimental  stage.  Various  methods  of  auto¬ 
matic  indexing  are  described  and  evaluated.  Areas  of  research  required 
to  improve  operational  qualities  of  proposed  systems  are  indicated.  It 
is  hoped  that  this  study  will  help  systematize  the  thoughts  of  persons 
interested  in  automatic  indexing  and  that  it  will  suggest  various  possible 
approaches  to  solutions  of  their  particular  problem. 

Emphasis  has  been  placed  on  quantitative  rather  than  qualitative 
methods  of  automatic  indexing.  At  this  stage  of  development  quantitative 
methods  offer  much  greater  possibilities  for  practical  application  because 
they  are  less  complicated  and  therefore  less  expensive  and  time  consuming. 
Qualitative  methods,  such  as  the  methods  of  linguistic  analysis  which  form 
the  basis  of  machine  translation,  were  only  remotedly  considered  by  a  few 
researchers  for  application  In  machine  indexing  and  very  little  has  been 
done  to  test  these  ideas  in  practical  experiments. 
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The  study  also  does  not  cover  research  aimed  at  full  text  searches 
of  documents,  even  though  there  are  some  problems  common  to  index  generation. 
It  was  not  considered  the  purpose  of  this  study  to  investigate  that  which 
makes  indexing  necessary  or  superfluous,  but  how  to  produce  an  index  by 
machine. 

The  assumption  is  made  throughout  that  material  to  be  processed 
is  in  machine  readable  form.  In  other  words,  the  study  neither  concerns 
itself  with  the  conversion  to  machine  readable  form  nor  with  the  equipment 
required  to  perform  the  conversion.  It  is  realized  that  at  this  time 
conversion  to  machine  readable  form  solely  for  the  purpose  of  machine 
indexing  would  not  be  economical.  In  the  near  future,  however,  print¬ 
reading  devices  such  as  computers  with  optical  scanners  should  be 
sufficiently  developed  to  make  this  task  economically  feasible  and 
desirable.  For  material  not  yet  printed,  type-punching  devices  attached 
to  typewriters  and  type  setting  machines  could  readily  produce  machine 
readable  records  as  by-products.  It  is  therefore  anticipated  that  the 
time  is  not  too  far  off  when  recording  information  directly  in  machine 
readable  form  will  be  a  common  thing.  This  could  then  open  the  doors  for 
a  wide-scale  application  of  machine  indexing  -  and  machine  abstracting  - 
systems. 


Note:  As  this  state-of-the-art  study  was  being  completed,  the  author 
received  a  published  copy  of  a  similar  study  by  Marie  E.  Stevens 
(5^).  Fortunately,  there  seems  to  be  no  real  duplication  of 
effort.  Whereas  the  work  of  M.  E.  Stevens  covers  a  wider  area 
of  the  utilization  of  machines  in  indexing,  this  paper  is  more 
task  oriented  towards  system  design. 
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L  2.  MACHINE  INDEXING  METHODS 

For  the  purpose  of  investigating  automatic  indexing,  it  is 
convenient  to  differentiate  between  indexing  by  extraction  and  indexing 
by  assignment.  In  the  first  case,  viz.  indexing  by  extraction,  selected 
words  which  appear  in  the  documents  are  used  as  indexing  terms.  The 
design  objective  is  to  make  the  machine  select  words  which  adequately 
represent  the  contents  of  the  document  and  to  record  them.  In  the  case 
of  indexing  by  assignment,  decision  is  first  made  by  the  programmed  machine 
as  to  which  particular  category  or  class  of  human  knowledge  the  document 
to  be  indexed  belongs  and  then  words,  which  are  considered  to  be  most 
pertinent  descriptors  of  that  particular  class  or  category,  are  assigned 
as  indexing  terms.  These  words  may  or  may  not  appear  in  the  document  it¬ 
self.  Thus,  if  the  document  is  on:  INVESTIGATION  OF  TURBULENCE  EFFECTS 
IN  IONIZED  PLASMA  FLOW,  the  derived  indexing  terms  might  be  TURBULENCE, 
IONIZED,  PLASMA,  FLOW.  The  assigned  indexing  terms  might  be,  for  instance, 
MAGNETOHYDRODYNAMICS  and  PHYSICS.  Obviously,  the  second  method  can  also 
be  r  ferred  to  as  automatic  categorization. 

1.2.1.  Indexing  by  Extraction 

One  of  the  crucial  problems  in  selecting  and  extracting  indexing 
terms  from  the  text  of  the  document  is  to  find  the  significant  ones,  viz. 
such  terms  which  would  most  adequately  represent  the  contents  of  the  docu¬ 
ment  for  their  later  identi flection  in  a  retrieval  process.  There  are 
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several  criteria  which  can  be  applied,  and  which  have  been  more  or  less 
successfully  applied,  in  selecting  significant  words  from  the  text. 

These  criteria  may  be  classified  into  four  main  categories: 

positional  and  typographical  criteria 
semantic  and  syntactic  criteria 
pragmatic  criteria 
statistical  criteria 

Positional  and  Typographical  Criteria.  Significance  is  often 
attributed  to  words  in  titles  of  the  documents  or  in  section  headings. 

On  a  sample  of  25  articles,  included  both  in  Physics  Abstracts  and  Chemical 
Abstracts ,  Maizell  (183)  showed  that  the  titles  alone  contained  about  50-70 
percent  of  the  key  terms  under  which  the  articles  were  actually  indexed.  A 
study  by  Montgomery  and  Swanson  (195)  of  the  Index  Medicus  led  them  to  the 
conclusion  that  titles  alone  provide  about  50  percent  of  clues  for  judging 
the  relevance  of  a  given  article  to  a  given  information  need. 

A  well  known  operational  system  based  on  tnis  concept  is  the 
KWIC  (Key  Words  In  Context)  index  of  significant  words  in  titles,  which 
is  being  used  for  Biological  Abstracts  and  a  number  of  other  indexes. 

Baxendale  (9)  proposed  to  partition  the  title  into  phrases  of 
three  types:  prepositional  phrases,  phrases  containing  a  conjunction,  and 
clauses.  The  identification  of  specific  structural  features  within  a 
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title  is  aided  by  a  dictionary  of  approximately  300  entries  consisting 
of  the  letters  of  the  alphabet,  certain  punctuation  symbols,  and  certain 
words  representing  relatively  stable  syntactic  features  such  as  auxiliary 
verbs  or  irregular  adverbs.  The  eligible  index  terms,  one-,  two-  or 
three-words  long,  are  recognized  and  selected  by  the  computer  from  the 
partitioned  title  units,  and  their  grammatical  function,  such  as  adjective 
or  noun,  is  then  assigned.  Thus,  the  selection  and  assignment  rules  are 
based  on  the  position  of  the  words  rather  than  by  the  recognition  of  their 
grammatical  function.  The  computer  program  for  this  system,  written  in  the 
COMIT  language,  is  called  "Title  Analyzer." 

There  are  other  positional  criteria  besides  titles.  According  to 
Baxendale  (7),  references  on  composition  techniques  state  that  the 
"strategic"  location  for  the  prime  thought  of  a  paragraph  is  either  first 
or  last.  In  other  words,  these  are  the  positions  for  the  greatest  emphasis. 
An  investigation  of  a  sample  of  200  paragraphs  corroborated  the  rule: 
in  85  percent  of  the  paragraphs  the  topic  sentence  was  the  initial  sentence 
and  in  7  percent  the  final.  Operating  on  these  sentences  only  not  only 
would  greatly  reduce  the  volume  of  the  article,  but  also  would  have  the 
added  advantage  of  eliminating  much  of  the  less  significant  vocabulary  as 
well  as  many  of  the  least  pertinent  parts  of  speech,  such  as  verbs  and 
adverbs.  Baxendale  reported  in  her  experiments  the  percentage  of  conden¬ 
sation  achieved  by  selection  of  topic  sentences  and  deletion  of  common 
words  ranging  from  6.3  to  18.9  percent  or  an  average  of  11.6  percent. 
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Two  quasi -automatic  methods  o.f  indexing  proper  nouns  (quasi¬ 
automatic  because  they  involve  a  considerable  amount  of  human  postediting) 
were  described  by  Artandi  (4).  Both  methods  are  based  on  the  criteria 
that  proper  names  appear  capitalized  in  natural  text. 

Semantic  and  Syntactic  Criteria.  Significance  is  contributed  to 
words  in  virtue  of  their  relation  to  certain  other  words,  also  called  cue 
words,  such  as  "summary, ""conclusion ,"  etc.  A  technique  utilizing  this 
method  is  described  in  the  Ramo-Wooldr idge  report  on  automatic  indexing  and 
abstracting  (48).  A  cue  word  glossary  is  compiled  for  the  population  of 
documents.  By  hypothesis  the  cue  words  tend  to  indicate  (or  appear  in 
proximity  to)  important  or  significant  material.  Using  the  cue  words, 
an  initial  set  of  sentences  is  selected,  which  is  then  examined  by  the 
program  to  identify  those  words  which  are  the  most  likely  to  be  the  key 
words  of  the  document.  The  cue  word  list  also  contains  common  words,  which 
carry  little  or  no  information,  but  these  common  words  are  assigned  a 
weight  of  zero  and  are  thereby  eliminated  from  the  document.  Thus,  every 
word  of  text  is  classified  as  either  cue  word,  insignificant  word,  or 
potential  key  word.  The  immediate  application  of  key  words  is  using  them 
as  indexing  terms. 

'  » 

O'Connor  (45)  studied  the  cue-  and  key-word  method  by  searching 
for  computer  rules  which  would  duplicate  indexing  done  by  subject  specialist 
for  a  pharmaceuticals  retrieval  system.  To  begin  with,  he  investigated  just 


16 


a  single  term  toxicity.  One  hundred  documents,  containing  thirteen 
toxicity  papers,  was  the  first  random  sample  from  the  total  population 
of  some  ten  thousand  documents  in  the  Merck  Sharp  and  Dohme  Research  Center 
Library.  Computer-generated  word  frequency  lists  were  prepared  for  each 
sample  document.  A  thesaurus  group  of  likely  toxicity  keywords  was  derived 
from  the  retrieval  system's  indexing  guide,  a  medical  dictionary,  and  the 
papers  in  the  sample.  Thirty  sample  papers  each  contained  at  least  one 
keyword;  eight  of  these  were  papers  on  toxicity.  Five  other  papers  on 
toxicity  appeared  to  contain  no  keywords.  Frequencies  and  positions  of 
keywords  in  documents,  and  differing  weights  for  keywords,  were  used  in 
the  attempt  to  reduce  keyword  overassigning  of  toxicity.  Frequencies  did 
not  appear  to  help.  To  some  extent,  weighting  helped  but  the  best  criterion 
seemed  to  be  occurrence  in  summaries. 

A  further  investigation  of  this  approach  lead  to  many  other 
expressions  for  toxicity  cues,  but  they  could  not  be  used  directly  for 
mechanized  indexing  because  they  were  unlikely  to  recur  in  other  papers 
on  toxicity.  Study  of  these  expressions  suggested  their  generalization 
to  "expression  forms"  containing  variables.  The  possible  values  of  the 
variables  were  defined  for  computer  use  by  lists  of  "substance-contact 
words"  and  "disorder  words."  "Expression  forms"  permitted  assigning  the 
indexing  term  toxicity  mechanically  to  four  relevant  papers  which  contained 
no  original  keywords.  Various  elaborate  indexing  rules  using  "expression 
forms"  were  suggested.  The  best  of  these,  combined  with  rules  involving 
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keywords,  selected  all  twenty-one  papers  on  toxicity  as  well  as  nine 
irrelevant  papers  from  the  given  sample. 

We  might  also  include  in  this  category  Baxendale's  (7)  suggestion 
to  select  prepositional  phrases  as  containing  the  significant  words  of  the 
text.  According  to  Baxendale,  a  phrase  is  likely  to  reflect  the  content 
of  an  article  more  closely  than  any  other  simple  construction.  Therefore, 
she  proposed  to  make  the  preposition  itself  the  indicator  for  initiating 
selection  of  index  units.  The  length  of  a  phrase  varies  from  two  to  seven 
words,  with  an  average  of  four  words  (based  on  a  count  of  words  per  phrase 
in  350  phrases).  Thus,  "by  running  the  risk  of  selecting  too  large  or  too 
small  a  unit,  but  obviating  the  necessity  of  discriminating  to  select  nouns 
and  their  modifiers,  it  is  possible  to  program  a  computer  to  recognize  the 
preposition  by  table  look-up  and  then  automatically  select  the  next  four 
words  unless  a  second  preposition  or  a  punctuation  mark  is  encountered." 

For  example,  the  machine  would  select  the  underlined  words  or  work  groups 
in  the  following  sentence:  Within  the  scope  of  natural  English  language, 
an  infinite  number  of  different  sentence  structures  is  possible.  The 
percentage  of  condensation  achieved  by  selection  of  prepositional  phrase 
and  deletion  of  common  words  was  reported  from  4,8  to  18.2,  the  overage 
being  11,3  percent. 

Pragmatic  Criteria.  This  approach  is  based  on  the  assumption, 
as  proffered  by  Artandi  (4,  5,  6)  and  Kraft  (28),  that  it  is  possible  to 
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create  a  vocabulary  or  a  list  of  terms  and  a  syndetic  apparatus  for  a  given 
subject  area  which,  if  sufficiently  representative  of  the  field,  may  be 
used  in  the  construction  of  indexes  to  materials  in  the  same  subject  area 
by  matching  the  thesaurus  against  the  text  of  documents.  A  system  based 
on  this  criteria  alone  was  described  by  Artandi  (6).  The  vocabulary  of 
the  proposed  system  includes  the  following  elements:  (1)  terms  in  the 
detection  part  of  the  vocabulary,  each  of  which  may  consist  of  one  or 
several  words,  entirely  identical  with  the  phraseology  of  the  text;  (2) 
terms  in  the  expression  part  of  the  vocabulary,  which  are  the  terms  of 
the  final  index  and  may  or  may  not  be  identical  with  the  corresponding 
detection  term.  A  section  of  a  chemistry  textbook  was  selected  as  the 
experimental  document  and  it  was  reported  that  the  vocabulary  of  the  system 
contained  7 44  detection  terms.  Unfortunately,  the  report  does  not  contain 
information  on  the  size  of  the  document  or  the  total  number  of  words  it 
contained,  neither  does  it  give  a  description  of  how  the  detection  terms 
were  derived  or  selected. 

Kraft  (28)  describes  in  his  paper  a  system  claimed  to  be  the  first 
Selective  Dissemination  of  Information  (SDt)  system  in  operation.  The 
system  includes  an  automatic  indexing  phase  based  on  a  similar  approach  to 
the  one  described  above.  The  punched  cards  containing  the  abstracts  are 
automatically  indexed  by  the  SOI  program  on  the  IBM  1401,  The  indexing 
can  be  done  in  either  of  two  ways:  (1)  terms  may  be  selected  from  the 


abstract,  title,  and  author's  name  if  they  do  not  match  a  word  on  an 
exclusion  list  of  common  words  stored  on  magnetic  tape;  (2)  terms  may  be 
selected  from  the  abstract,  title,  and  author's  name  if  they  match  a  word 
in  a  dict'onary  stored  on  magnetic  tape  and  are  not  on  an  exclusion  list 
of  common  words.  Th*'  manually-selected  descriptors  are  also  indexed  by 
the  program.  Using  the  d  Jtionary  approach  combined  with  the  exclusion 
list,  an  ave  nge  of  22  keywords  are  chosen  per  item.  The  exclusion  list 
technique  alone  indexes  an  tern  by  an  average  of  4l  keywords.  It  must 
be  noted,  however,  that  the  requirements  for  that  system,  which  serves 
salesmen  and  system  engineers  of  the  IBM  Corporation  Midwestern  Region 
Office  in  Chicago,  Illinois,  are  not  very  sophisticated. 

Statistical  Criteria.  Statistical  approach  to  automatic  indexing 
seems  to  be  the  most  promising.  Luhn,  Baxendale,  Levery,  Williams,  and 
others  have  experimented  with  this  approach.  In  most  cases,  the  first  step 
is  deletion  of  insignificant  words.  This  is  done  by  designing  a  look-up 
list  for  the  computer  which  might  include  pronouns,  articles,  conjunctions, 
conjunctive  adverbs,  copula  and  auxiliary  verbs,  quantitative  adjectives 
and  similar  words.  The  size  of  such  a  list  varies  from  100  to  700  words  for 
the  systems  reported.  Condensation  thus  achieved  ranges  from  50  to  70  per¬ 
cent  (7).  A  modified  procedure  is  to  delete  all  words  with  three  or  fewer 
characters  (6). 

At  this  point,  one  approach,  originated  by  Luhn  (32),  consists  of 
making  absolute  frequency  count  of  the  remaining  words,  ordering  them  by 
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descending  frequency,  and  selecting  the  words  within  a  certain  frequency 
range  as  the  most  significant  ones  (7).  The  justification  of  measuring 
word  significance  by  use-frequency  is  based  on  the  fact  that  a  writer 
normally  repeats  certain  words  as  he  advances  or  varies  his  arguments  and 
as  he  elaborates  on  an  aspect  of  a  subject.  No  effort  is  made  to  differ¬ 
entiate  between  word  forms.  Luhn  argued  that  within  a  technical  discussion 
there  is  a  very  small  probability  that  a  given  word  is  used  to  reflect  more 
than  one  notion.  The  probability  is  also  small  that  an  author  will  use 
different  words  to  reflect  the  same  notion.  Even  if  the  author  makes  a 
reasonable  effort  to  select  synonyms  for  stylistic  reasons,  he  soon  runs 
out  of  legitimate  alternatives  and  falls  into  repetition  if  the  notion 
being  expressed  was  potentially  significant  in  the  first  place. 

As  to  the  upper  bound  of  the  frequency  ra.  Luhn  proposed  two 
solutions.  One  solution  would  be  not  to  set  any  upper  Mmit,  and  to 
eliminate  the  common  words,  which  can  naturally  be  expected  to  cluster  in 
the  high  frequency  region,  by  comparing  them  with  a  stored  common  word 
list.  Another  solution  is  to  determine  a  high  frequency  cutoff  through 
statistical  methods  to  establish  "confidence  limits."  Since  degree  of 
frequency  has  been  proposed  as  a  criterion,  a  lower  boundary  would  also 
be  established  to  bracket  the  portion  of  the  spectrum  that  would  contain 
the  most  useful  range  of  words.  The  optimum  lo^-tions  for  these  cutoffs 
would  be  established  from  experiments  with  large  samples  of  Input  data. 
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Luhn  believed  that  it  should  even  be  possible  to  adjust  these  locations 
to  alter  the  characteristics  of  the  output.  If  non-common  words  fall 
into  the  high-frequency  region,  it  would  indicate  their  loss  of  discrimi¬ 
natory  power.  Common  words  falling  in  the  region  of  acceptable  frequency 
would  be  tolerated  because  of  their  lesser  degree  of  interference.  Thus, 
it  may  be  anticipated  that  the  cutoff  line,  once  established,  may  be  stable 
over  many  different  degrees  of  specialization  within  a  field,  or  even  over 
many  different  fields. 

In  the  experiments  reported  by  Luhn,  the  determination  of  this 
frequency  range  was  arbitrary.  Luhn  assumed  that  10  to  24  of  the  highest 
ranking  words  are  the  most  significant  ones  for  document  identification, 

16  such  words  being  the  likely  average.’'  The  size  of  the  document  collec¬ 
tion  for  which  this  size  pattern  would  suffice  has  not  been  determined. 
Indications  are,  however,  that  size  of  collection  is  not  a  major  function 
in  determining  optimum  pattern  size. 

The  refinements  of  this  method  are  the  "normalization"  of  the 
list  viz.  combining  the  terms  on  the  list  to  notions  by  look-up  in  the 
special  thesaurus,  and  switching  to  the  so  called  "multi-dimensional 
patterns."  For  the  latter  purpose,  the  automatic  process  would  proceed 

*  Baxendale  (7)  assumed  the  number  of  allowable  words  for  the  index  as  0.5 
percent  of  the  total  number  of  words  in  the  article,  the  ones  which 
occurred  with  the  highest  frequency  after  the  deletion  of  common  words. 


to  extract  from  the  sentences  all  word  pairs  consisting  either  of  two 
adjoining  first  order  words  or  of  a  first  order  word  coupled  to  a  second 
order  word,  the  first  order  words  marked  by  an  appropriate  sign.  A  record 
is  then  developed,  giving  for  each  first  order  word  (node)  all  the  words 
which  have  been  found  paired  to  it  (branches). 

Instead  of  operating  with  single  words,  Meetham  (191)  investiga¬ 
ted  the  possibility  of  extracting  significant  word  pairs  and  word  groups 
for  an  automatic  generation  of  descriptor  systems  and  for  indexing.  All 
possible  pairs  of  words  were  examined  and  those  pairs  selected  which  occur 
so  frequently  in  the  same  document  (in  relation  to  their  frequencies  of 
occurring  separately)  that  the  frequency  of  their  co-occurrence  is  pro¬ 
bably  not  by  chance.  The  second  stef  is  to  discover  word-groups  from  an 
examination  of  the  word-pairs.  The  words  from  which  such  groups  are  made 
are  picked  out  from  a  word  list  by  using  a  word-word  binary  matrix  to 
represent  the  association  between  pairs  of  words. 

A  relative  frequency  approach  proposed  by  Edmundson  and  Wyllys 
(23)  takes  into  account  the  fact  that,  according  to  information  theory,  a 
word's  information  value  should  vary  inversely  rather  than  directly  with 
its  frequency,  its  low  probability  evidencing  greater  selectivity,  or 
deliberation,  in  its  use.  It  is  the  rare,  special,  or  technical  word  that 
will  indicate  most  strongly  the  subject  of  the  author's  discussion.  Here, 
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however,  by  "rare"  is  meant  rare  in  general  usage,  not  rare  within  the 
document  itself.  In  other  words,  Edmundson  and  Wyllys  claim  that  it  is 
wrong  'to  treat  a  document  as  the  universe  of  words.  Rather,  the  frequency 
of  a  word  in  a  document  should  be  compared  with  the  frequency  of  the  same 
word  in  general  use,  viz.  to  regard  the  contrast  between  the  word's 
relative  frequency  f  within  the  document  and  its  relative  frequency  r 
in  general  use  as  a  more  revealing  indication  of  the  word's  value  in 
indicating  the  subject  matter  of  document  d.  Four  types  of  significance 
functions  s(f,r)  are  proposed,  s  =  f  -  r,  s  =  f/r,  s  =  f/(f+r),  and 
s  =  log(f/r),  of  which  s  =  f  •  r  or  s  =  f/r  are  suggested  as  the  best 
choice.  According  to  the  authors,  defining  significance  in  terms  of  the 
contrast  between  frequency  in  a  document  and  in  general  usage  would  give 
low  significance  both  to  normally  rare  words  which  occur  rarely  in  the 
document  and  the  common  words  used  frequently  within  the  document  itself. 
The  relative  frequencies  are  calculated  as  follows: 

fwd  ■  Wd  rwc  -  N„C/NC 

where 

Nwcj  is  the  number  of  occurrences  of  word  w  in  document  d 

Nd  is  the  total  number  of  running  words  in  d,  i.e.  Nwcj 

w 

Nwc  is  the  number  of  occurrences  of  word  w  in  the  class  of  documents  c 

Nc  “E  Nwc 

w 

A  further  refinement  of  the  process  of  automatic  analysis  would 
be  the  development  of  special  sets  of  reference  frequencies  for  special 
fields  of  interest.  Two  benefits  are  claimed  for  this:  it  would  become 
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possible  to  classify  documents  as  to  field,  and  it  would  become  possible 
to  note  the  significance  of  words  which  are  frequent  in  a  very  large  re¬ 
ference  class  cQ  of  literature  (i.e.  these  words, would  be  significant 
with  respect  to  cQ)  but  which  are  rare  in  the  special  field. 

To  demonstrate  how  this  method  would  operate,  assume  that  the 

relative  frequencies  of  m  words  have  been  established,  both  for  a  large 

reference  class  cQ  of  literature  and  also  for  n  special  fields  of  interest 

c  ,  j  =  1,2, . n.  Thus,  there  would  be  n  +  1  values  of  relative  fre- 

J 

quency  for  each  word  w,  where  w  runs  from  1  to  m,  and  where 

rWQ  =  relative  frequency  of  word  w  with  respect  to  the  class  cQ 
literature 

rwj  *  relative  frequency  of  word  w  with  respect  to  special  field  c . . 

Next,  the  m  x  (n  +  1)  matrix  ( rwj )  is  formed,  each  column  of 
which  contains  the  frequencies  of  all  the  listed  words  for  a  particular 
field  (the  whole  body  of  literature  being  represented  in  the  first  column) 
and  each  row  of  which  contains  the  frequencies  of  a  particular  word  in  all 
the  1 isted  fields. 

The  Automatic  indexing  would  then  proceed  as  follows:  first,  the 
determination  of  the  words  that  are  significant  with  respect  to  general 
literature  by  the  comparison  of  the  relative  frequencies  fwcj  of  words  in 
d  with  the  relative  frequencies  in  the  first  column  of  the  matrix  (rwj); 
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second,  the  comparison  of  the  document's  frequencies  with  the  other  columns 
in  the  matrix  in  order  to  determine  which  column  forms  the  "best  fit"  with 
the  document;  and  third,  the  determination  of  the  words  that  are  significant 
with  respect  to  the  special  field.  One  standard  method  for  determining  the 
"best  fit"  would  be  to  find  the  column  j  whose  frequencies  differ  least 
from  those  of  the  document.  Once  frequency-ordered  indexes  have  been 
established  for  various  subject-fields  the  automatic  index  of  any  new  docu¬ 
ment  can  be  compared  with  them  by  machine  processes.  According  to  the 
authors,  the  results  of  the  comparison  would  determine,  first,  the  subject 
field  to  which  the  document  properly  belongs  (classification);  second,  other 
subject-fields  with  which  it  should  be  associated  (cross-reference);  and 
finally,  those  terms  which  are  significant  enough  to  be  used  as  identifica¬ 
tion  tags  for  the  process  of  recovering  the  document  (retrieval). 

As  an  extension  of  the  relative  frequency  approach,  involving 
syntactic  and  semantic  approaches,  the  author  proposes  the  introduction 
of  weighted  frequency.  The  machine  can  be  instructed  to  recognize  the 
title  by  position  and  capitalization  and  to  place  a  "title  indication" 
after  each  word  appearing  in  the  title  as  it  compiled  its  list.  Similarly, 
it  can  place  "first-paragraph  indications"  after  ail  words  it  meets  until 
it  recognizes  the  end  of  the  first  paragraph.  Every  heading  or  sub-title 
can  be  tested  for  the  words  "summary"  or  "conclusions"  and  place  a 
"summary  indication"  after  each  word  in  the  summary  paragraphs.  At  the 
conclusion  of  its  "reading"  of  the  article,  the  machine  can  compute 
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each  word's  weighted  significance  S  according  to  the  formula: 


5  =  s(f,r) , 

where  for  a  given  word  w, 

if  w  bears  a  title  indication 
1  otherwise 

bp  if  w  bears  a  first-paragraph  indication 
1  otherwise 

b$  if  w  bears  a  summary  indication 
1  otherwise 

and  where  bt,  bp,  and  bs  are  preassigned  weights,  all  greater  than  one,  for 
occurrence  in  title,  first  paragraph,  and  summary,  respectively. 


Alternatively,  statistical  methods  of  this  type  might  be  used  as 
preliminary  sorting  for  later  application  of  non-stat i st ical  criteria.  For 
example,  when  a  word  already  known  to  be  somewhat  significant  by  statistical 
methods  also  occurs  in  the  title,  its  significance  might  be  taken  as 
guaranteed,  and  the  machine  program  could  recognize  the  fact  by  placing 
it  on  the  "definitely  significant"  list,  even  though  the  word  was  outranked 
in  significance  by  other  words.  Recapitulating,  the  final  selection  of 
significant  words  would  be  based  on  three  criteria:  (l)  significance  of 
the  word  with  respect  to  general  literature,  (2)  significance  of  the  word 
with  respect  to  a  specialized  field,  and  (3)  placement  of  the  word  on  a 
"definitely  significant"  list.  Under  criteria  I  and  2  there  would  be  an 
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alternative  of  selecting  either  all  words  whose  significance  value 
exceeded  a  predetermined  threshold  value  s,  or  only  the  first  n  words  in 
order  of  significance  from  the  highest  down,  adding,  in  either  case,  those 
words  selected  by  criterion  3.  ■ 

A  somewhat  similar  but  simplified  technique  is  described  by 
F.  Levery  (30)  of  the  International  Business  Machine  Corp.  in  France.  Non¬ 
common  words  are  first  combined  to  form  notions  with  the  help  of  a  diction¬ 
ary  of  synonyms,  and  the  fEequency  of  the  notions  is  counted  for  selection 
of  significant  terms.  Two  criteria  were  applied  for  the  selection  of 
keywords:  (1)  frequency  of  the  appearance  of  notions  above  the  average 
frequency  of  all  notions  in  the  text  studied,  and  (2)  the  frequency  of  the 
appearance  of  the  word  to  exceed  the  average  frequency  in  the  entire 
Collection.  The  experiments  were  conducted  on  French  language  texts 
dealing  with  the  manufacture  and  study  of  glass.  Thirty  documents  were 
machine  indexed,  each  document  being  200  to  600  words  long.  The  total 
number  of  words  was  10,721.  The  deleted  list  for  the  whole  collection 
consisted  of  668  words,  which  appeared  6,589  times  and  thus  accounted  for 
61.4  percent  of  the  words  present.  The  1,681  different  non-common  words 
found  in  the  collection  were  grouped  into  897  notions.  The  30  most  frequent 
notions  accounted  for  over  one-fourth  of  the  non-common  words  appearing 
(4,132).  The  input  processing  was  done  on  an  IBM  7094  computer  which 
supplied  for  each  document  a  word  list  in  alphabetic  order  and  another 
list  in  order  of  frequency. 
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The  technique  for  selecting  significant  words,  proposed  by 
Oswald  (47),  has  the  following  main  features:  (1)  Insignificant  viz. 
common  words  are  deleted  and  only  words  that  are  significant  in  the  context 
of  the  document  are  retained.  (2)  The  retained  words  are  frequency  counted. 
(3)  Next,  every  juxtaposition  (of  two  or  more  words)  involving  a  high- 
frequency  word  is  recorded  as  a  significant  word  group.  The  recording  of 
such  groups  begins  with  those  that  contain  the  single  word  of  highest 
frequency  and  continues  until  six  successive  Uni  term  words ,  in  order  of 
descending  frequency  on  the  Uniterm  frequency  list,  produce  either  no  signi¬ 
ficant  groups  or  no  new  significant  groups.  This  rule  produces  auto-indexes 
whose  lengths,  although  differing,  usually  lie  within  the  limits  of  1  to  3 
percent  of  the  total  vocabulary  of  any  given  article. 

Finally,  special  consideration  should  be  given  to  the  text  con¬ 
densation  and  index  editing  method  by  consolidating  concept  related  words 
which  are  spelled  in  the  same  way  at  their  beginning,  such  as  el  1 iptic  and 
el  1 ipticity.  The  procedure  proposed  by  luhn  (32)  Is  a  statistical  analysis 
routine  consisting  of  a  letter-by-letter  comparison  of  pairs  of  succeeding 
words  in  the  alphabetized  list.  From  the  point  where  letters  failed  to 
coincide  a  combined  count  was  taken  of  the  non-similar  subsequent  letters 
of  both  words.  When  this  count  was  six  or  below,  the  words  were  assumed 
to  be  similar  notions;  above  six,  different  notions.  Although  this  method 
of  word  consolidation  Is  not  infallible,  errors  up  to  5  percent  did  not 
seem  to  affect  the  final  results. 


1.2.2,  Indexing  by  Assignment 


This  type  of  indexing  presupposes  categorization  or  classification 
of  documents  as  the  first  step  in  the  selection  of  indexing  terms.  Various 
approaches  to  automatic  document  categorization  will  be  briefly  surveyed 
here. 


Maron's  (36)  method  starts  with  selecting  statistically  cue  words' 
from  a  sample  population  of  documents  previously  assigned  to  certain  cate¬ 
gories  by  human  indexers.  The  complete  corpus  consisted  of  405  different 
documents  and  was  divided  into  two  groups.  Group  1  contained  260  abstracts 
which  appeared  in  the  March  and  June  issues  of  the  1959  IRE  Transactions  on 
Electronic  Computers,  and  was  the  basis  for  the  statistical  data  necessary 
to  make  the  subsequent  predictions.  Group  2  consisted  of  1 45  abstracts  which 
appeared  inf  the  September  1959  issue  of  the  Transactions  and  was  used  to 
test  the  system. 

A  classification  system  of  32  categories  was  created  similar  to, 

but  not  identical  wi,th,  tfie  classification  system  used  in  the  IRE  Trans- 

\ 

actions,  and  each  one  of  the  260  documents  of  Group  I  was  carefully  read 
and  "sorted"  into  one  or  more  of  the  categories.  In  the  majority  of 
instances  a  document  was  indexed  under  a  single  category,  but  in  about  20 
percent  of  the  cases  a  document  was  indexed  under  two  categories,  and  in 
only  a  few  cases  under  three  categories.  The  highest  number  of  documents 
in  a  single  category  was  37 *  and  the  lowest  was  2. 
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Next,  every  word  in  each  of  the  documents  of  Group  1  was  key¬ 
punched.  There  was  a  total  of  over  20,000  word  occurrences  with  an  average 
of  79  words  per  document,  and  a  total  of  3,263  different  words.  The  55 
most  frequently  occurring  logical  type  viz.  common  words  (e.g.  the,  of,  a, 
etc.)  accounted  for  8,402  of  the  total  (20,515)  occurrences.  Thus,  less 
than  2  percent  of  the  words  accounted  for  over  40  percent  of  the  total 
occurrences.  They  were  rejected  as  candidates  for  cue  words. 

The  most  frequently  occurring  non-common  words  were  considered 
next.  This  list  contained  words  such  as  "computer .""system, ""data," 
"machine,"  etc.  They  also  were  rejected  as  possible  cue  words  because  it 
was  felt  that  they  had  little  discriminating  power  to  be  cues  for  the 
specification  of  subject  content  within  the  general  field  of  computers. 

Of  the  total  3,263  different  words,  2,120  or  65%  occurred  less  than  three 
times  in  the  260  documents.  They  were  also  rejected  as  possible  cue  wo,  ds 
because  they  were  too  specific  (provided  they  were  indicative  of  the 
contents  of  the  document  at  all).  This  left  just  over  1,000  different  words 
with  neither  a  very  high  nor  very  low  relative  frequency  of  occurrence.  A 
listing  was  made  showing  the  number  of  times  each  of  these  1,000  words 
occurred  in  the  documents  belonging  to  category  1,  category  2,  etc.  Each 
word  on  the  list  was  checked  to  determine  whether  or  not  it  "peaked"  in 
any  of  the  23  categories.  If  a  word  did  peak  it  was  felt  that  the  word 
would  be  a  good  cue.  If  the  distribution  was  flat  for  a  given  word,  then 
it  was  rejected.  An  attempt  was  made  to  find  at  least  one  word  to  peak  In 


each  of  the  32  categories.  In  this  way,  90  different  words  were  finally 
selected  as  cue  words. 

Then  the  problem  was  conceived  as  follows:  Given  that  a  document, 
say  Dj,  contains  one  or  more  cue  words  Wj,  what  is  the  probability  that  Dj 
belongs  to  each  of  the  categories  Cj,  C21  C^,  and  so  on.  Maron  used  the 
well  known  Bayes  prediction  equation  to  calculate  these  probabilities.  For 
one  cue  word  Wj,  the  equation  is: 

P(Cj).  P(Wj|Cj) 

P(Wj) 

P (Cj )  is  the  so-called  a  priori  probability  that  a  document  will  be  indexed 
under  the  j-th  category  and  P(WjJcj)  is  the  probability  that  if  a  document 
is  indexed  under  the  j-th  category  it  will  contain  word  Wj .  For  any  W; , 
the  denominator  P (W ; )  is  a  constant  and  hence  the  equation  may  be  rewritten 
as  follows: 

P(Cj|Wj)  -  k  •  P(Cj)«  P(Wj|Cj) 

where  k  is  a  scaling  factor.  The  value  of  P(Cj)  is  estimated  by  counting 
the  number  of  index  entries  that  are  made  under  the  j-th  category  and 
divding  this  by  the  total  number  of  Index  entries.  The  values  of 

Cj)  are  estimated  by  counting  the  number  of  occurrences  of  the  i-th 
word  which  belong  to  documents  that  were  indexed  under  the  j-th  category 
and  dividing  through  by  the  total  number  of  cue  word  occurrences  in  all 
documents  belonging  to  the  j-th  category. 
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In  the  general  case  where  a  document  contains  different  cue  words, 

Wk,  Wm,  .  Ws,  the  probability  that  the  document  belongs  to  the  j-th 

category  is  computed  as  follows: 

p(wk,  Wm,  . Ws,  c.)  =  k*p(c.)*p(cj,  wk).p(Cj,  Wm)  ...  P(Cj,  Ws) 

The  values  of  the  left  hand  side  of  the  above  equation  are  called  "attribute 
numbers."  Thus,  32  attribute  numbers  are  obtained  for  each  document,  one 
for  each  of  the  32  categories. 


It  turned  out  that  in  the  initial  group  of  260  documents,  12 
documents  contained  none  of  the  90  cue  words,  and  hence  no  automatic 
indexing  was  possible  for  these  12  documents.  Also  there  was  an  error 
preventing  one  of  the  remaining  documents  from  being  automatically  indexed. 
This  left  247  documents.  In  209  of  the  247  cases  (84.6%),  the  category 
with  the  greatest  attribute  number  in  each  output  list  was  a  correct 
category.  If  the  document  had  at  least  two  cue  words,  then  the  probability 
that  the  category  with  the  greatest  attribute  number  is  a  correct  one  was 
91  percent.  In  Group  2,  which  was  the  new  input  to  be  tested,  of  a  total 
of  145  documents,  20  contained  no  cue  words,  and  40  contained  only  one 
cue  word.  This  left  85  documents,  each  containing  at  least  two  different 
cue  words.  In  44  (51.8%)  of  these  85  cases  the  machine  printed  the  correct 
category  at  the  top  of  the  output  list,  l.e,  the  category  with  the  greatest 
attribute  number  was  the  correct  category.  The  probability  that  the  machine 
will  print  out  the  correct  category  In  one  of  the  first  three  positions 
was  80  percent. 
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A  modified  approach  to  evaluate  the  "goodness"  of  the  cue  words 
was  proposed  by  Trachtenberg  (59)*  It  involves  calculating  for  each  potential 
predictor  or  cue  word  (a)  the  non-correlation  factor  of  word  occurrence 
category,  or  the  uncertainty  of  category  given  the  occurrence  of  a  word  Wj 
in  a  document 

Hi  =  -£  Pjj  lo9  Pij  0  ~  H.  —  log  k 

j 


where  p^is  the  probability 
category  Cj  , 


that  a  document  with  the  word  W.  falls  into  the 


and  (b)  a  special  measure  involving  the  log  of  the  ratio  of  the  a  posteriori 
to  the  a  priori  probability,  viz. 

_  £Lp 

«.  =2,  P|j  i°s  pj 

j 

A  word  that  has  a  high  value  for  Mj  and  a  low  value  for  Hj  would  be  selected 
as  the  cue  word. 


Similar  procedures  were  proposed  to  treat  word  frequency  infor¬ 
mation.  The  corresponding  equations  are: 

Hi  (fs>  "  -E  Pij  <fs)  ,09  Pi  j  (fs> 

j 

Mi  “E  Pij  ^ f ,09  Pl-* 

Pj 

where  fs  is  the  range  of  the  values  of  relative  frequency  of  a  word  appearing 
in  a  document  to  the  total  number  of  words  in  that  document,  and  Pjj(fs) 
is  the  probability  that  the  document  falls  in  category  Cj  given  that  the 


relative  frequency  of  word  Wj  in  the  document  is  in  the  interval  fg.  No 
testing  of  the  proposed  method  was  made. 

Borko  (11)  proposed  a  method  which  uses  "factor  loadings"  of  terms 
as  probability  measures  for  determining  the  category  to  which  a  document 
belongs.  Briefly,  his  approach  is  as  follows.  Six  hundred  and  eighteen 
psychological  abstracts  were  coded  in  machine  language  for  computer  processing. 
The  total  text  consisted  of  approximately  50,000  words,  of  which  nearly  6,800 
were  unique.  The  computer  program  arranged  these  words  in  order  of  frequency 
of  occurrence.  From  the  list  of  words  which  occurred  20  or  more  times, 
excluding  syntactical  terms  such  as,  and,  but,  of,  etc.,  the  investigator 
selected  90  words  for  use  as  index  terms.  These  were  arranged  in  a  data 
matrix  with  the  terms  on  the  horizontal  and  the  document  number  on  the 
vertical  axis;  the  cells  contained  the  number  of  times  the  term  was  used 
in  the  document.  Based  on  these  data,  a  correlation  matrix,  90  by  90  in 
size,  was  computed  which  showed  the  relationship  of  each  term  to  every  other 
term.  To  compute  the  correlation  coefficient  from  raw  score  data  (Document- 
Term  Matrix),  the  following  formula  was  used: 

rXy  .  -HEXV^-LDlHCV) 

-y[-cx*  -  (Ex)2]  [nEY2  -  OEV) 2] 

where  N  ■  total  number  of  documents,  and  X  and  Y  are  terms  being  correlated. 

A  computer  program  for  calculating  these  correlations  was  written  by  the 
Systems  Development  Corp, 
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There  are  a  number  of  methods  for  estimating  the  commonality. 

The  simplest  procedure  would  be  to  choose  the  highest  correlation  coefficient 
from  among  the  other  correlations  in  that  set.  By  grouping  together  the  re¬ 
lated  terms,  a  classification  system  for  the  given  corpus  of  documents  could 
be  derived.  However,  this  is  not  a  task  that  can  be  done  by  inspection.  In 
the  90  by  90  matrix,  which  is  symmetrical,  there  are  4,005  correlations.  In 
order  to  analyze  the  data  in  a  precise  fashion,  Borko  employed  the  technique 
of  factor  analysis. 

The  purpose  of  factor  analysis  is  to  reduce  the  original  correla¬ 
tion  matrix  to  a  smaller  number  of  factors.  A  factor  corresponds  to  the 
eigenvector.  The  size  of  the  eigenvector,  i.e.,  the  eigenvalue,  is  equal 
to  the  contribution  of  the  variance  made  by  that  factor.  The  first  eigen¬ 
vector,  or  factor,  accounts  for  a  relatively  large  proportion  of  information 
and  each  succeeding  factor  accounts  for  less.  In  factor  analysis  it  is 
not  necessary  to  account  for  the  total  variance  of  the  correlation  matrix, 
for  it  is  known  that  a  certain  proportion  of  the  variance  is  unique  or 
specific  to  the  given  set  of  documents  in  the  experimental  situation.  It 
is  the  common  variance  which  is  of  interest  only,  viz.  that  portion  of  the 
variance  that  is  due  to  the  relationship  among  the  terms  and  which  would 
continue  to  be  true  for  all  sets  of  documents.  The  problem,  of  course,  is 
to  determine  the  proportion  of  the  total  variance  which  is  common. 
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Based  on  the  above  considerations,  the  matrix  was  factor  analyzed 
and  the  first  ten  eigenvectors  were  selected  as  factors.  These  were  ro¬ 
tated  for  meaning  and  interpreted  as  major  categories  in  a  classification 
system.  These  factors  were  compared  with,  and  shown  to  be  compatible  with 
but  not  identical  to,  the  classification  system  used  by  the  American 
Psychological  Association. 


A  similar  approach  to  the  problem  solution  chosen  by  Maron  is 
reported  by  Williams  of  the  IBM  Corporation  (61).  He  also  proposed  a 
discriminant  coefficient  to  identify  significant  words.  This  discriminant 
coefficient  is  a  function  of  the  relative  frequencies  of  the  i-th  word  in 
the  j-th  category 
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These  coefficients  are  calculated  from  the  data  obtained  from  a 


small  set  of  reference  documents  previously  classified  into  categories 


(hierarchical  classification  structure  assumed)  by  human  indexersJ  The 
discriminant  coefficients  thus  computed  are  used  to  set  up  discriminant 
thresholds  determining  which  words  will  be  used  in  the  classification 
equation  and  to  assign  weighting  factors  to  the  words  themselves.  The 
computer  program  categorizes  documents  by  comparing  the  observed  with  the 
theoretical  word  frequencies  and  computing  a  Relevance  Value  (RV)  for  each 
document  with  respect  to  each  category.  The  RV  equation  is 


where  PjQ  is  the  relative  observed  frequency  in  the  document,  p;  j  is  the 
relative  theoretical  frequency  of  the  i-th  word  in  the  j-th  category  after 
transformation  to  document  size,  and  m  is  the  number  of  word  types  in  the 
group.  Documents,  which  show  highest  RV  for  a  particular  category,  are 
classified  accordingly.  Those  documents  having  a  RV  outside  the  standard 
deviation  limits  would  be  returned  for  re-evaluation. 

A  somewhat  simplified  approach  was  taken  by  Stevens  (55)  of  the 
National  Bureau  of  Standards.  The  SADSACT  (Self-Assigned  Descriptors  from 


'  For  the  experiment,  400  computer  abstracts  prepared  vjnd  published  by 
Cambridge  Communications  Corp,  were  selected.  Each  of  the  abstracts 
was  classified  by  CCC  in  their  normal  operation.  Three  hundred  of  the 
400  abstracts  were  used  as  reference  documents,  and  were  equally  divided 
among  the  20  categories  of  the  classification  system.  The  remaining  100 
were  used  as  the  test  documents.  The  objective  of  the  experiment  was  to 
classify  the  100  test  documents  into  their  correct  categories. 
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Self  and  Cited  Titles)  method  correlates  descriptors  or  indexing  terms  with 
significant  words  in  a  representative  sample  for  the  population  of  docu¬ 
ments  to  be  indexed,  viz,  each  significant  word  in  the  title  and  in  the 
abstract  of  the  document  is  associated  with  each  of  the  descriptors  pre¬ 
viously  assigned  to  that  document.  Descriptors  that  occurred  three  or  more 
times  in  the  100-item  sample  were  retained  as  "validated  descriptors."  For 
the  validated  descriptors,  the  word-descriptor  association  lists  were  then 
merged  into  a  master  vocabulary  list  which  showed  for  each  word  the 
descriptors  with  which  it  co-occurred  and  the  relative  frequencies  of  its 
co-occurrence  with  each  descriptor. 

Thus,  the  SADSACT  automatic  indexing  method  used  an  ad  hoc  statistical 
association  technique  in  which  each  word  may  be  associated  either  appropri¬ 
ately  or  inappropriately  with  a  number  of  different  descriptors.  The 
indexing  procedure  was  carried  out  as  follows.  The  text  of  the  title  of 
a  new  item  and  of  titles  cited  as  bibliographic  references  by  the  author 
was  keypunched,  and  the  byproduct  punched  paper  tape  was  converted  to  cards 
for  input  to  the  computer.  This  input  material  was  processed  against  the 
master  vocabulary  list  to  yield,  for  each  word  that  matched  a  word  in  the 
vocabulary,  a  "descriptor-selection-score"  value  for  each  of  the  descriptors 
previously  associated  with  that  word.  After  all  words  from  titles  and  cited 
titles  were  processed,  the  descriptor  scores  were  summed  and  for  some 
appropriate  cutoff  level,  those  descriptors  having  the  highest  scores  were 
assigned  to  the  new  item. 
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The  actual  score  value  includes  both  a  normalizing  factor  (based, 
for  example,  or  :>a  ratio  of  the  number  of  previous  co-occurrences  of  this 
word  with  a  part  K-?r  descriptor  to  the  number  of  different  words  co¬ 

occurring  wita  t.hai  f  scriptor)  and  a  weighting  formula  that  gives  greater 
emphasis  to  words  occurring  in  "self-title  (the  authors  own  choice  of 
terminology)  than  to  those  occurring  in  cited  titles.  Similarly,  greater 
emphasis  is  given  to  words  that  coincide  with  the  names  of  descriptors. 

Baker  (67)  recognized  the  similarity  between  document  classification 
and  the  problems  inherent  in  the  analysis  of  sociological  questionnaire  data 
and  proposed  the  classification  method  based  upon  Lazarsfeld's  (Stouffer) 
latent  class  analysis.  Briefly,  the  latent  class  model  assumes  that  the 
population  -  that  is  the  number  of  documents  in  the  sample  -  can  be  divided 
into  a  number  of  mutually  exclusive  classes.  Usually  the  number  of  classes 
is  determined  by  the  investigator,  although  it  is  conceivable  that  this 
parameter  can  be  determined  mathematically.  One  starts  by  selecting  the 
key  words  which  characterize  each  class  of  documents.  Then  latent  class 
analysis  is  used  to  compute  the  probability  that  a  document  having  a  certain 
pattern  of  key  words  belongs  to  a  given  class.  For  instance,  assume  that 
there  are  1,000  documents  in  a  file.  These  documents  are  to  be  classified 
into  two  classes  -  those  dealing  with  computer  automated  Instruction  and 
those  not  directly  related  to  this  topic.  The  following  key  words  are 


kO 


selected  in  the  search  request: 

1.  computer 

2.  automated 

3.  teaching 

4.  devices 

Each  of  the  1,000  documents  is  then  analyzed  to  determine  whether  it  con- 
tains  one  or  more  of  the  four  terms.  Sixteen  (2  )  response  patterns  are 
possible,  ranging  from  4+++  to  0000.  A  chi-square  test  enables  one  to 
estimate  the  latent  structure  from  the  observed  data.  Having  obtained 
a  latent  structure  which  fits,  one  can  compute  an  ordering  ratio,  which  is 
the  probability  that  a  document  having  a  given  word  pattern  belongs  to  a 
particular  latent  class.  For  example,  a  document  with  all  four  key  words 
present  has  a  probability  of  .998  of  belonging  to  class  1,  i.e.,  it  is 
concerned  with  computer  automated  instruction.  The  method  seems  to  have 
merit,  but  no  experiments  were  actually  made  to  test  it. 

Obviously,  once  the  document  is  delegated  to  a  specific  class  or 
category  by  one  of  the  above  described  methods,  indexing  terms  or  terms 
identifying  the  contents  of  that  class  can  be  tagged  or  assigned  to  the 
document. 


41 


1.3.  MACHINE  INDEXING  EVALUATION 

Nc  absolute  standards  have  been  as  yet  discovered  for  machine 
indexing  evaluation  and  measuring  its  "goodness"  just  as  there  are  no 
standards  and  absolute  measures  of  "goodness"  of  human  indexing.  Therefore 
some  authors  represent  the  viewpoint  that  until  such  standards  and  measures 
are  discovered,  if  they  can  be  discovered  at  all,  only  relative  or  indirect 
evaluation  is  possible  by  comparing  a  particular  method  of  machine  indexing 
with  other  operational  systems,  human  or  machine.  Thus,  there  are  two 
possibilities:  (1)  comparing  machine  indexing  with  human  indexing  and 
(2)  comparing  one  machine  indexing  method  with  another. 

Most  investigators  have  attempted  to  compare  machine  indexing  with 
human  indexing  and  less  has  been  done  in  comparing  machine  indexing  vs 
machine  indexing.  The  reason  for  this  might  be  that  so  far  there  are  only 
a  few  experimental  automatic  indexing  systems  being  operationally  tested 
and  there  is  very  little  data  on  their  actual  performance. 

Another  suggested  approach  to  the  evaluation  problem  is  to 
determine  the  quality  of  indexing  by  evaluating  the  quality  of  retrieval. 
Meetham  ( 1 9 1 )  indicates  In  his  report  on  the  proposed  automatic  indexing 
system  that  the  evaluation  of  the  system  on  53  inquiries  in  a  sample 
collection  of  documents  produced  an  overall  relevance  ratio  up  to  0.33 
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and  an  overall  recall  ratio  up  to  0.38.  Unfortunately,  very  little  was 
reported  on  the  test  methodology  and  systems  parameters.  Such  data  would 
have  greatly  enhanced  the  value  of  this  pioneering  effort. 

Another  small  scale  experiment,  containing  elements  of  this 
approach,  is  described  by  Swanson  (56).  A  collection  of  100  articles  was 
chosen  as  an  experimental  library,  and  each  article  in  the  collection  was 
studied  in  the  light  of  its  possible  relevance  to  each  of  50  questions 
asked.  All  the  articles  were  on  nuclear  physics.  Furthermore,  in  order 
to  compare  the  effectiveness  of  text  searching  by  computer  with  more  or 
less  conventional  methods,  the  experimental  collection  of  articles  was 
catalogued  by  means  of  a  subject  heading  index  designed  for  this  particular 
fi_ld  of  science.  Three  methods  of  retrieval  were  employed:  (1)  "Conven¬ 
tional  retrieval"  based  on  the  subject  heading  index  with  no  machine  pro¬ 
cedures  involved;  (2)  Retrieval  based  on  specifications  of  words  and  phrases 
in  disjunctive  and  conjunctive  combinations  without  any  other  retrieval 
aids;  (3)  Search  requests  formulated  as  describedin  the  second  case  but 
w  i  th  the  thesaurus-like  word  and  phrase  group  list  and  the  index  thereto 
as  retrieval  aids.  The  results  in  terms  of  "pr  cent  of  relevant  material 
retrieved  averaged  over  all  requesters  and  all  questions"  were  reported 
as  follows:  Test  One  -  38  percent;  Test  Two  -68  percent;  Test  Three  -  86 
percent. 


No  other  practical  studies  in  evaluating  automatic  indexing  by 
retrieval  efficiency  besides  these  two  limited  size  experiments  are  known. 
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Theoretically  the  possibility  of  such  an  evaluation  and  the  implications 
of  this  method  are  discussed  by  O'Connor  in  his  paper  Mechanized  Indexin 


Methods  and  Their  Testing  (44) ,  which  covers  also  a  wide  range  of  other 
problems  related  to  machine  indexing. 

The  problem  of  relevance,  which  involves  high  subjective  criteria 
and  therefore  is  hardly  accessible  to  formalization,  can  be  avoided  by 
taking  a  strictly  formal  approach  to  index  evaluation.  In  the  case  of 
human  indexing,  this  would  presuppose  that  the  choice  of  indexing  terms  by 
one  indexer  is  as  good  as  by  any  other,  provided  of  course,  that  the  indexers 
are  qualified  specialists.  Thus  the  choice  of  the  indexing  terms  is  accepted 
by  the  user  at  their  "face  value"  within  certain  confidence  limits,  which 
are  set  by  the  variance  of  indexers  in  the  selection  of  terms  to  tag  a 
particular  book  or  document.  It  is  then  up  to  the  user,  reference  librarian 
or  the  information  systems  specialist  to  make  the  best  use  of  the  tools  the 
indexer  gives  him  to  obtain  maximum  efficiency  from  the  system  subject  to 
known  limitations.  The  evaluation  problem  thus  becomes  a  problem  of  a 
formal  evaluation  of  the  system  as  a  communication  channel,  which  on  this 
basis  is  entirely  accessible  to  mathematical  analysis. 

Extending  this  approach  to  automatic  indexing,  and  in  particular 
to  indexing  by  extraction,  we  assume  that  the  author  of  a  book  or  document 
is  competent  enough  to  express  the  subject  matter  in  pertinent  words  and 
that  his  choice  of  words  is  therefore  accepted  without  questioning.  All 
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the  machine  does  is  to  eliminate  from  the  author's  text  words  which 
carry  no  information  for  any  user,  and  to  condense  the  meaningful  terms 
to  keep  the  index  size  within  tolerable  limits.  Here  again,  the  efficiency 
coefficient  of  information  transmission  would  be  the  only  formal  measure 
in  evaluating  a  given  system. 

1.3.1.  Comparing  Machine  and  Human  Indexing 

Our  attempt  to  evaluate  machine  indexing  by  comparing  it  with  a 
"well-reputed"  human  indexing  system  was  part  of  a  wider  study  on  machine 
indexing  and  abstracting  efficiency  by  Kurmey  (29).  The  source  data  was 
obtained  by  selecting  50  abstracts  at  random  from  Chemical  Abstracts,  10 
each  for  the  years  1951  through  1955.  and  tracing  the  abstracts  to  the 
original  articles.  These  articles  were  then  machine  indexed,  the  only 
criterion  of  significance  being  the  frequency  of  word  appearance  after  the 
deletion  of  common  words.  Total  number  of  words  in  all  these  articles  was 
131,283,  number  of  different  words  (excluding  common  words)  -  21,200, 
number  of  common  words  -  3,362,  and  average  word  frequency  -  5.3^.  The 
predetermined  frequency  cutoff  was  obtained  by  dividing  the  list  at  the 
closest  word  frequency  group  corresponding  to  the  number  of  terms  assigned 
by  Chemical  Abstracts.  The  machine  index  so  created  was  then  used  for 
direct  comparison  with  the  index  entries  assigned  to  the  same  article  by 
Chemical  Abstracts.  In  addition,  the  word  frequency  list  for  each  article 
was  manually  scanned  to  determine  if  the  terms  assigned  by  Chemical  Abstracts 
were  at  all  present  in  the  machine  derived  list. 


Analysis  of  the  index  terms  assigned  by  Chemical  Abstracts  was 
carried  out  manually  prior  to  comparison  with  the  machine  index  terms.  For 
each  article,  the  Chemical  Abstract  entries  consisting  of  two  or  more  words 
were  broken  into  single  word  entries. 

Comparison  of  the  index  words  was  carried  out  manually  with  two 
different  approaches.  In  the  first  approach,  the  entire  alphabetized  non¬ 
common  word  list  of  an  article  was  scanned  to  see  if  the  word  used  by 
Chemical  Abstracts  was  in  the  text  of  the  article.  The  agreement  between 
the  Chemical  Abstracts  words  and  the  word  list  was  taken  on  a  straight 
percentage  basis.  In  the  second  approach,  the  number  of  words  used  by 
Chemical  Abstracts  was  used  as  a  cutoff  to  obtain  words  with  the  highest 
frequencies.  The  agreement  was  also  taken  on  a  straight  percentage  basis. 

The  percentages  use,  as  a  base,  the  number  of  words  in  the  Chemical  Abstracts 
entries. 

The  average  overall  conformity  between  the  alphabetized  noncommon 
word  list  and  Chemical  Abstracts  entries  was  found  to  be  81.76  percent.  The 
average  overall  conformity  between  the  subset  of  words  of  highest  frequency 
and  Chemical  Abstracts  entries  was  27.63  percent. 

It  is  apparent  that  "maximum-depth"  indexing  would  cover  most 
(81.76%)  of  the  entries  used  in  the  Chemical  Abstracts  indexes  for  the 
articles.  Most  of  the  indexing  terms  used  by  Chemical  Abstracts  appear  in 
the  article  hence  the  high  agreement  for  the  alphabetized  noncommon  words. 
However,  the  most  frequently  occurring  words  on  the  word  frequency  list 


would  only  poorly  duplicate  human  index  entries  for  an  article  as  only 
27.63%  concur  with  the  entries  in  Chemical  Abstracts.  Therefore,  Kurmey 
came  to  the  conclusion  that  the  subset  of  highest  frequency  words  used 
in  the  article  do  not  form  adequate  index  entries  for  the  article.  Apart 
from  constructing  "maximum-depth"  indexes  consisting  of  all  different  words 
occurring  in  the  article  except  those  on  an  ad  hoc  "stoplist,"  Kurmey 
could  not  see  any  straightforward  statistical  method  of  arriving  at  index 
entries  derived  from  a  word  frequency  model  of  text  with  comparable  entries 
in  Chemical  Abstracts.  Possible  improvement  in  the  indexing  entries  was 
suggested  by  utilizing  a  thesaurus  applicable  to  the  field  of  chemistry 
to  select  significant  words  by  direct  match. 

Contrary  to  Kurmey's  results,  comparing  the  lists  of  index  terms 
obtained  in  the  machine  indexing  experiments  by  relative  frequency  with 
the  list  of  terms  prepared  manually,  Levery  (30)  determined  that  on  the 
average  more  than  85  percent  of  the  keywords  chosen  by  the  analysts  were 
also  selected  by  the  machine  method.  The  lists  prepared  manually  were 
arranged  in  descending  order  of  :’gnificance  of  the  keywords  and  the  same 
words  were  obtained  by  automatic  means.  The  elimination  of  common  words 
and  the  regrouping  of  synonyms  was  done  by  hand. 

In  the  related  field  of  book  indexes,  Artandi  (6),  using  a 
section  of  an  inorganic  chemistry  textbook  as  the  experimental  document, 
compared  the  mechanical  index  with  the  average  manually  produced  Index 
found  in  inorganic  chemistry  textbooks.  The  author  claims  that  the 
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mechanically  produced  index  compares  favorable  in  intellectual  content 
with  the  average  published  manual  index  for  the  same  type  of  material. 
Completeness  of  indexing  (takes  into  consideration  index  entries  actually 
assigned,  incorrect  entries,  and  omissions,  that  is,  entries  that  should 
have  been  assigned  but  were  missed  for  some  reason),  which  is  a  numerical 
figure  supposed  to  include  both  qualitative  and  quantitative  evaluative 
criteria,  was  found  to  be  practically  identical  for  the  experimental  index 
and  for  the  average  of  the  published  manual  indexes  checked.  Entry  density 
(the  ratio  of  the  total  number  of  page  references  to  the  total  number  of 
pages)  was  63.8%  higher  for  the  mechanical  index  than  the  corresponding 
ave; age.  Heading  density  (the  ratio  of  the  total  number  of  index  entries 
to  the  total  number  of  words  in  the  book)  was  found  to  be  8.8%  lower  than 
the  corresponding  average.  The  heading  densities  of  the  individual  pub¬ 
lished  indexes  checked  for  the  study  fell  in  the  range  of  41.8%  below  and 
56.0%  above  the  average  heading  density  value.  It  seems,  however,  that 
there  might  be  a  possibility  of  these  figures  being  greatly  biased  because 
of  predetermined  matching  instruction  In  the  indexing  procedures  and  rather 
artificial  test  conditions. 

In  a  modified  experiment  by  Artandi  (4),  two  methods  of  machine 
indexing  proper  nouns  were  tested  on  the  same  inorganic  chemistry  textbook, 
which  contained  a  total  of  148  proper  noun  terms.  Of  the  total  of  324 
entries  produced  by  the  machine,  208  entries  or  63.1  of  all  the  produced 
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entries  were  useless  or  not  proper  noun  entries,  viz.  noise.  Only  87 
entries  were  useful  proper  noun  entries,  of  which  74  entries  or  22.4 
percent  of  all  the  entries  produced  by  the  machine  did  not  need  any  human 
edi t ing. 

O'Connor  (40,  46)  used  for  his  study  of  the  compatabi 1 i ty  of 
mechanized  indexing  with  human  indexing  the  existing  retrieval  system  at 
a  pharmaceutical  research  laboratory  (Merck  Sharp  &  Dohme  Research  Labora¬ 
tories,  West  Point,  Pa.).  Several  dozen  documents  were  examined  for  each 
of  the  three  terms,  penici 1 1  in.  toxic i ty .  and  mode  of  action.  The  only 
approaches  considered  were  those  involving  occurrence  and  frequency  of  the 
term-word  (e.g.  toxicity)  and  synonymous  and  related  words.  For  each 
indexing  term,  efforts  were  made  to  find  a  computer  rule  which  would  assign 
that  particular  term  to  just  those  documents  assigned  that  term  by  the 
human  indexers.  It  was  established  that,  for  instance,  if  "penicillin" 
was  assigned  as  a  term  if  the  word  "penicillin"  occurred  at  least  once, 
the  result  was  overassignment  up  to  ten  percent  of  the  entire  document 
collection;  if  the  term  was  assigned  when  the  word  occurred  at  least  twice, 
the  system  would  fail  to  assign  "penicillin"  to  at  least  one  tenth  of  the 
documents  which  should  have  had  it.  No  general  rules  or  conclusions  were 
proposed. 

A  program  to  evaluate  and  compare  the  efficiency  of  machine 
indexing  methods  with  human  indexing  with  regard  to  the  relevancy  of  docu¬ 
ments  retrieved  was  also  reported  by  Donald  J.  Hillman  (26),  but  neither 
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the  results  of  the  experiment  nor  a  more  detailed  statement  of  the  methods 
to  be  used  in  the  evaluation  are  as  yet  available. 

The  effectiveness  of  machine  indexing  from  titles  has  been 
evaluated  on  1,500  technical  titles,  chiefly  in  the  field  of  physics,  by 
Baxendale  (9).  There  were  two  criteria  for  evaluation.  The  index  term 
had  to  be  constituted  solely  of  a  noun  and  its  adjective  modifiers,  and 
had  to  be  meaningful  with  respect  to  the  title.  Using  both  of  these  cri¬ 
teria  for  evaluation,  approximately  85  percent  of  the  1,500  titles  were 
indexed  with  100  percent  accuracy.  That  is,  all  possible  terms  were 
selected  and  all  satisfied  both  evaluation  criteria.  The  effectiveness 
of  the  remaining  15  percent  ranged  between  95  percent  and  40  percent 
accuracy. 


A  similar  project  comparing  the  results  of  automatic  computer 
indexing  of  titles  by  the  KWIC  system  with  human  indexing  using  a  subject 
heading  system  was  reported  by  Kraft  (167).  One  source  of  data  was  803 
legal  research  projects  and  these  titles  indexed  under  a  modified  form  of 
the  Index  of  Legal  Periodicals  (ILP)  system.  The  other  source  of  data  was 
2,625  legal  articles  classified  under  the  ILP  system.  Interpretation  of 
data  revealed,  among  other  things,  that  64.4%  of  the  title  entries  contained 
as  keywords  one  or  more  of  the  ILP  subject  heading  words  under  which  they 
were  indexed;  and  25. 1%  contained  logical  equivalents.  The  remaining  10.5% 
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of  the  title  entries  had  nondescriptive  titles.  The  author  concluded 
that  KW1C  indexing  of  legal  titles  produces  an  index  which  costs  less 
than  a  subject  heading  system  in  both  time  and  cost  of  production  and 
which  ranks  high  in  "f indabi 1 i ty." 

In  their  study  of  automatic  subject  indexing  from  textual  con¬ 
densation,  Slamecka  and  Zunde  (52)  examined  a  number  of  abstracts  published 
in  the  Scientific  and  Technical  Aerospace  Reports  by  Documentation  Inc. 
and  compared  their  contents  with  the  indexing  terms  which  were  assigned 
by  human  indexers  to  these  documents.  The  results  of  the  pilot  experiment 
showed  that,  on  the  average,  80.4  percent  of  the  index  terms  chosen  by 
analysts  were  also  contained  in  the  abstract,  and  that  each  abstract  con¬ 
tained  an  additional  10.9  terms  which  were  part  of  the  indexing  vocabulary 
(Uniterm-type  machine  term  vocabulary).  It  was  also  found  tTiat  a  conden¬ 
sation  of  approximately  83  percent  was  necessary  in  order  to  obtain  signi¬ 
ficant  indexing  terms  as  the  residue  of  a  deletion  process. 

The  above  described  investigations  compared  machine  indexing  by 
extraction  with  human  indexing.  Some  other  investigations  were  directed 
toward  comparison  of  machine  Indexing  by  assignment  (or  automatic  classi¬ 
fication)  with  corresponding  human  performance. 

In  the  attempt  to  measure  the  reliability  of  subject  classification 
by  men  and  machines  as  reported  by  Borko  (12),  three  subject  specialists 
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classified  997  abstracts  for  psychological  reports  into  one  of  eleven 
categories.  These  abstracts  were  also  mechanically  classified  by  a 
computer  program  using  a  factor-score  computational  procedure.  Each 
abstract  was  scored  for  all  categories  and  assigned  to  the  one  with  the 
highest  score.  The  three  manual  classifications  were  compared  with  each 
other  and  with  the  mechanical  classifications,  and  a  series  of  contingency 
coefficients  was  computed.  The  average  reliability  of  manual  classifica¬ 
tion  procedures  was  equal  to  .870.  The  correlation  between  automatic  and 
manual  classification  was  .766.  Furthermore,  it  was  concluded  that  humans 
will  agree  on  the  classification  of  approximately  75  percent  of  the  docu¬ 
ments,  while  automated  classification  procedures  will  agree  with  manual 
classification  59  percent  of  the  time.  Furthermore,  by  correcting  the 
data  for  attenuation  as  a  result  of  the  known  unreliability  of  the  cri¬ 
terion,  it  was  possible  to  determine  that  this  percentage  of  agreement 
between  automatic  classification  and  perfectly  reliable  human  classification 
could  be  raised  to  67  percent. 

Moreover  the  classes  derived  by  factor  analysis  were  compared 
with,  and  shown  to  be  similar  to,  the  existing  subject  classification  system 
employed  by  the  American  Psychological  Association.  According  to  Borko, 
the  study  demonstrates  the  feasibility  of  using  factor  analysis  as  a  method 
for  determining  the  basic  dimensions  of  a  classification  system. 
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Another  evaluation  experiment  was  carried  out  by  Williams  (61). 

Of  100  test  documents  initially  selected,  17  were  not  completely  indexed 
within  the  experimental  structure.  Therefore,  complete  results  were 
available  on  only  83  of  the  original  100  documents.  Sixty-three  of  these 
documents  were  classified  by  machine  into  only  one  category  at  both  the 
major  and  minor  levels  of  the  predetermined  hierarchial  classification 
system.  Twenty  documents  were  classified  into  one  category  at  the  major 
level  (higher  generic  level),  and  two  categories  at  the  minor  level.  When 
compared  with  the  classification  results  by  human  indexers  of  the  same 
documents,  of  the  first  group  of  documents  78  percent  were  correctly 
classified  at  the  major  level  and  64  percent  correctly  classified  at  the 
minor  levels.  Of  the  second  group  (20  documents),  95  percent  were  correctly 
classified  at  the  major  level  and  60  to  75  percent  at  the  minor  levels. 
According  to  Williams,  two  of  the  major  reasons  for  mi sclassi f icat ion  were 
heterogeneous  categories  and  small  sample  sizes.  Since  these  results 
were  obtained  on  only  15  reference  documents  per  category,  it  is  felt 
improvement  could  easily  be  achieved  by  . ncreasing  the  number  of  reference 
documents. 


In  Stevens's  (55)  study,  the  number  and  type  of  descriptors 
assigned  by  machine  were  compared  with  those  assigned  by  human  indexers, 
both  0DC  and  local.  For  the  documents  taken  from  the  teaching  sample,  the 
average  "hit"  accuracy  was  64.8  percent.  For  new  or  partially  new  input 
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(old  items  together  with  new)  the  "hit"  accuracy  or  the  percentage  of 
descriptors  originally  assigned  by  DDC  indexers  which  were  also  assigned 
by  machine,  was  48.2  percent.  No  significant  difference  in  the  average 
accuracies  was  obtained  as  between  using  t i t les-and-c i ted-t i t les  only  and 
using  ti tles-and-abstracts  from  the  same  items.  In  another  evaluation 
effort,  25  of  the  items  in  the  test  runs  were  submitted  to  one  or  more 
members  of  the  NBS  staff,  all  of  whom  were  users  of  the  collection.  They 
were  asked  to  choose  12  descriptors  for  each  item  exclusively  from  the  list 
of  descriptors  actually  available  to  the  machine.  The  percentage  of  iden¬ 
tical  descriptors  thus  chosen  were  from  40  percent  to  54.2  percent.  Thus 
the  results  appear  to  fall  within  the  range  of  agreement-data  for  human 
interirdexer  consistency. 

1.3.2.  Comparing  Various  Machine  Indexing  Methods 

As  yet,  only  a  limited  amount  of  research  has  been  done  to 
compare  one  machine  indexing  method  with  another  or  how  various  machine 
indexing  methods  perform  on  the  same  input  and  for  the  same  user  require¬ 
ments.  For  machine  indexing  by  extraction  Baxendale  (7)  compares  subject 
indexes  produced  by  simple  deletion  of  non-significant  terms,  by  selection 
by  topic  sentences  and  deletion,  and  by  selection  by  prepositional  phrases 
and  deletion.  She  arrives  at  the  conclusion  that  high  percentages  of  con¬ 
densation  are  possible  by  all  of  the  techniques  outlined  without  untoward 
loss  of  content  of  an  article.  No  clear  advantage  of  one  of  these  methods 


against  the  other  in  selection  of  indexing  terms  was  demonstrated,  except 
that  selection  by  prepositional  phrases  enables  the  system  to  produce 
precoordinated  terms,  which  under  certain  conditions  might  be  preferable 
to  Uni  terms. 

Borko  and  Berwick  ( 1 4 ,  15)  made  a  comparative  study  of  two 
methods  of  indexing  by  assignment  (automatic  classification).  To  test 
the  hypothesis  that  the  cl  ass i f icat ion  system  derived  by  factor  analysis 
provides  a  sound  basis  for  document  classification  and  is  compatible  with 
other  systems,  the  same  corpus  of  documents  was  selected  as  used  by  Maron 
in  his  automatic  indexing  experiment.  The  following  procedural  steps 
for  automatically  classifying  the  documents  were  used.  First,  each  docu¬ 
ment,  in  machine  readable  form,  was  analyzed  by  the  computer.  A  1 i st  of 
the  index  terms  and  their  frequencies  of  occurrence  in  each  document  was 
recorded.  Second,  the  category,  or  categories,  containing  the  index  term 
was  assigned  a  value  equal  to  the  product  of  the  number  of  occurrences  of 
the  word  in  the  abstract  and  the  normalized  factor  loading  of  the  word  in 
the  category.  If  more  than  one  index  term  appeared  in  a  category,  the 
products  were  summed.  Thus 

P  -  f (Lj  *  Ti  +  4  X  T2  +  *•••  +  Ln  x  V 

where 

P  ■  predicted  classification,  Ln  ■  normalized  factor  loading 
of  term  n  for  a  given  category,  and  Tn  *  number  of  occurrences 
of  the  n-th  term. 


Third,  after  each  index  term  had  been  considered,  the  category 


having  the  highest  numerical  value  was  selected  as  the  most  probable 
subject  classification  for  the  document  in  question.  Of  the  90  documents 
in  the  validation  group  which  contained  two  or  more  cue  words,  and  which 
therefore  could  be  automatically  classified,  44  documents,  or  48.9  per¬ 
cent,  were  placed  into  their  correct  categories  by  use  of  a  computer 
formula.  These  results  were  almost  identical  to  those  obtained  by  Maron 
in  a  previous  experiment  using  the  same  data  but  with  a  different  set  of 
classification  categories  and  a  different  computational  formula.  In 
classifying  the  documents  in  the  experimental  group  Maron's  technique  was, 
however,  superior.  There  the  percentage  of  correctly  classified  documents 
was  84.6%  by  Maron  as  against  63.4%  by  Borko.  Obviously,  the  factor  tech¬ 
nique  did  poorly  when  operating  on  the  specific  body  of  data  on  which  the 
classification  system  and  the  factor  loadings  were  derived.  A  possible 
explanation  is  that  the  factor  analysis  method  is  a  generalizing  technique 
designed  to  deal  with  common  properties  and  not  with  the  specific  variances 
found  in  a  population  sample.  In  contrast,  Maron's  technique  capitalizes 
on  the  specific  variance  in  the  sample  and,  therefore,  did  far  better  in 
the  automatic  classification  of  the  documents  in  the  experimental  group 
than  for  the  validation  group.  Consequently,  for  Maron's  technique,  the 
statement  that  "the  more  cue  words  in  the  document,  the  better  the  auto¬ 
matic  Indexing"  applies.  In  contrast,  a  prediction  technique  based  upon 
factor  loadings  appears  to  have  little  dependence  on  the  number  of  cue 
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words  in  the  article.  That  is  to  say,  the  number  of  documents  containing 
one  or  two  cue  words  were  classified  with  almost  the  same  degree  of 
accuracy  as  those  containing  a  larger  number.  This  makes  sense  when  one 
realizes  that  factor  analysis  is  a  generalizing  technique  designed  to 
minimize  the  specific  variance  of  the  individual  words.  As  a  result,  a 
method  of  automatic  document  classification  based  upon  factor  loadings 
enables  one  to  classify  documents  containing  a  minimum  of  index  terms. 

However,  since  the  nature  of  that  study  did  not  provide  for  an 
isolation  of  the  techniques  used  in  automatic  classification  from  the 
categories  themselves,  a  new  series  of  tests  were  conducted.  Three  hypo¬ 
theses  were  tested.  They  were:  (1)  using  the  original  classification 
schedule,  automatic  document  classification  will  be  more  successfully 
performed  by  means  of  a  Bayesian  prediction  equation  (Maron's  method)  than 
by  factor  scores;  (2)  using  the  modified  classification  schedule,  automatic 
document  classification  will  be  more  successfully  performed  by  means  of  a 
Bayesian  prediction  equation  than  by  factor  scores;  and  (3)  documents  will 
be  correctly  classified  in  the  modified  classification  schedule  in  a 
number  significantly  greater  than  In  the  derived  classification  scheme 
using  either  the  Bayesian  or  the  factor  score  procedures  for  automatic 
document  classification. 

It  was  concluded  that  there  was  no  statistically  significant 
difference  in  the  ability  of  these  two  procedures  to  automatically  classify 
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documents.  The  comparison  of  the  effectiveness  of  the  original  and  the 
modified  classification  categories  for  automatic  document  classification 
proved  that  more  documents  were  correctly  classified  when  using  the  modified 
schedule  than  by  using  the  original  and  that  the  increase  was  a  statistically 
significant  one  in  ti.  most  important  case  when  predicting  the  classifica¬ 
tion  of  the  previously  unexamined  documents  in  the  validation  group.  Borko 
and  Bernick,  therefore  arrived  at  the  following  three  c  nclusions.  First, 
it  is  possible  to  mathematically  derive  a  set  of  classification  categories 
that  are  descriptive  of  the  major  content  dimensions  of  a  population  of 
documents.  Furthermore,  these  dimensions  are  relatively  stable  as  long  as 
the  parent  population  is  itself  stable  and  unchanging.  Second,  automatic 
document  classification  is  possible  and  may  be  accomplished  by  use  of 
either  Bayesian  or  factor  score  procedures.  Third,  if  automatic  document 
classification  is  to  be  used,  superior  results  will  be  obtained  by  using 
mathematically  derived  classification  categories  based  upon  statistical 
analysis  of  the  words  in  the  documents  and  statistical  indexing  techniques. 

In  the  opinion  of  the  authors,  factor  analysis  has  been  demon¬ 
strated  to  be  a  useful  technique  for  determining  the  major  dimensions  in 
an  unstructured  mass  of  material.  It  has  been  used  to  derive  classifica¬ 
tion  categories  for  computer  literature  and  for  psychological  reports.  In 
both  cases  the  classification  categories  were  reasonable  and  reliable. 

Factor  analysis  can  be  applied  to  unstructured  subject  matter  such  as 


in  an  attempt  to  derive  a 


newspaper  articles,  intelligence  reports,  etc., 
reasonable  and  useful  set  of  classification  categories  for  this  type  of 
mater i al . 


1.4.  TIME  AND  COST  ANALYSES 

Little  has  been  reported  on  the  time  studies  of  processing  machine 
generated  indexes  and  even  less  on  the  costs  of  automatically  creating 
indexes.  Short  references  to  processing  time  are  in  the  Kraft  (28)  and 
Levery  (30)  papers  only.  Kraft  reports  that  using  an  IBM  1401  computer 
with  8,000  memory  positions  and  tape  drives,  the  following  processing  times 
were  observed  for  the  auto-indexing  run: 

(a)  using  an  exclusion  list  of  600  common  words - 16  seconds  per 

document. 

(b)  using  an  exclusion  list  of  600  common  words  and  an  accept  list 
of  2,200  words— 60  seconds  per  document. 

The  above  times  include  card  reading,  auto-indexing,  and  tape  writing. 

Levery  (30)  reports  processing  time  of  15  seconds  per  document. 

It  is  not  specified  in  the  report  whether  this  includes  all  the  time  for 
matching  terms,  calculating  their  absolute  and  relative  frequencies,  etc., 
but  it  may  be  assumed  that  it  does, 

Artandi  presented  cost  estimates  for  mechanical  book  indexing  and 
for  mechanical  indexing  of  proper  nouns.  For  book  Indexing  by  the  method 
of  matching  the  text  against  a  vocabulary  on  the  IBM  1620  computer,  Artandi 
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(6)  quotes  $2,046  per  page  with  an  initial  investment  from  $2.71  to  $0,014 
per  page  for  5  to  1,000  books  over  a  period  of  10  years.  The  first  or 
operating  costs  are  broken  down  into  the  following  items: 


conversion  of  text  for  machine  input . $  .712 

machine  time  and  labor  for  one  run .  .979 

alphabetization,  elimination  of  duplicates, 

crossreferences,  material  (est.)  .  .355 

$  2.046 

For  mechanical  indexing  of  proper  nouns  on  the  IBM  1620  computer, 
Artandi  (4)  quotes  $2.06  per  text  page  for  5  books  and  $1.92  per  text  page 
for  100  books  as  compared  with  $0.04  per  text  page  if  indexing  is  done  by 
conventional  methods,  viz.  by  human  indexers. 
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1.5.  CONCLUSIONS  ANO  RECOMMENDATIONS 

Studies  and  experiments,  which  have  been  done  on  automatic  indexing, 
indicate  that  operational  systems  of  this  type  are  entirely  possible  in 
principle.  No  pioneering  discoveries  are  required  to  have  the  machine  read 
the  document  and  index  it  either  by  extracting  pertinent  terms  from  the  text 
of  the  document  or  by  assigning  terms  based  on  the  document  analysis.  How¬ 
ever,  a  considerable  amount  of  research  is  still  required  in  order  to  have 
the  machine  do  it  well  and  efficiently.  Thus,  the  problem  is  basically  that 
of  optimization:  optimizing  index  file  structure  and  organization,  improving 
term  selection  criteria,  applying  methods  of  linguistic  analysis  for  class 
identification,  etc.  The  problem  is  also  one  of  cost:  under  what  circum¬ 
stances  does  it  pay  to  have  indexing  done  by  machine. 

A  large  amount  of  statistical  analysis  is  needed  to  establish 
significance  criteria  for  selecting  words  as  indexing  terms  by  the  frequency 
of  their  occurrence.  The  relative-frequency  concept  of  word  significance 
should  be  compared  with  the  simple-frequency  approach.  Word-frequency  counts 
for  specialized  subject-fields  need  to  be  conducted  and  utilized  for  establish¬ 
ing  profile  parameters.  Functions  that  derive  a  measure  of  significance  from 
the  relative  frequency  of  a  word  should  be  compared  for  ease  of  interpreta¬ 
tion  and  computation,  and  for  amount  of  discrimination.  Statistical  criteria 
for  selecting  words  (Uni terms)  and  precoordinated  terms  should  be  devised, 
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and  the  merits  of  these  methods  should  be  carefully  evaluated.  Studies 
should  also  be  made  of  the  optimal  number  of  terms  per  document,  whether 
the  number  of  terms  are  related  to  the  number  of  words  in  the  document 
and  the  total  number  of  documents  in  store  and  how  they  are  so  related, 
and  whether  it  is  desirable  or  possible  to  predetermine  the  ratio  of 
single  words  to  pre-coord inated  terms  in  the  index,  if  both  types  of  terms 
are  used. 


Additional  studies  are  needed  to  investigate  the  effectiveness  of 
non-statistical  measures  of  significance,  such  as  positional  or  pragmatic 
measures.  It  should  be  established  whether  it  is  necessary  and/or  desirable 
to  delete  common  words  by  using  a  stop  list  when  indexing  is  done  by  statis¬ 
tical  method  (frequency  count)  and  it  should  be  determined  what  the  optimal 
size  of  such  a  stop  list  should  be.  The  effect  on  index  quality  of  matching 
terms  on  a  predetermined  number  of  characters  should  be  studied  as  well. 

If  the  quantitative  procedures  combined  with  the  simple  «on- 
probabi 1 istic  measures  of  word  significance  do  not  produce  the  desired 
refinement  of  the  indexing  system,  consideration  might  be  given  to  qualita¬ 
tive  analysis,  such  as  investigating  synthetic  or  linguistic  relations. 
However,  quantitative  methods  should  be  preferred  over  qualitative  ones 
which  require  interpretation  of  the  text  because  quantitative  methods  are 
much  less  complicated  and,  therefore,  much  less  costly  and  time  consuming. 
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For  indexing  by  assignment  or  automatic  categorization,  the  way 
to  improve  the  systems  efficiency  is  by  conducting  more  extensive  studies 
on  the  relations  of  words  or  word  combi nat ions  and  categories  of  various 
classification  systems,  the  degree  of  the  resolution  of  the  classification 
system,  definitions  of  class  profiles,  and  word  significance  coefficients. 

The  size  of  the  sample  of  documents  for  the  determination  of  the  total 
population  parameters  and  the  number  of  the  terms  assigned  to  the  documents 
as  well  as  their  generic  relations  should  also  be  analyzed. 

It  might  be  advantageous  to  combine  both  the  indexing  by  extraction 
and  the  indexing  by  assignment  methods  in  one  system.  This  might  provide 
for  better  term  r, election  and  assignment  control  possibilities. 

In  all  the  systems  proposed  thus  far,  one  element  is  generally 
missing,  the  absence  of  which  hardly  justifies  calling  these  systems 
automatic.  This  missing  element  is  the  feedback  loop,  which  is  essential 
for  any  automatic  system  expected  to  react  to  changing  input  conditions. 
Schematically,  the  present  systems  under  study  can  be  represented  by  the 
following  diagram  (Fig.  1). 


f 

1 NPUT  ( 

PROCESSOR 

OUTPUT 

1 

(Indexer) 

Fig.  I 
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For  automatically  operating  systems,  there  should  be  at  least 


one  feedback  loop  from  the  indexing  output  (Fig.  2). 


1 NPUT  r\ 

PROCESSOR 

(Indexer) 

_ m _ 

OUTPUT 

J 

L_J 

Fig.  2 


A  more  advanced  system  should  have  another  feedback  loop  from 
the  retrieval  output  to  the  input  into  the  system:  (Fig.  3) 


Fig.  3 


The  two  feedback  loops  are  necessary  to  adjust  indexing  parameters  accordi 
to  the  changing  quality  and  quantity  of  input  materia)  and  user's  require¬ 
ments  and  to  control  the  index  file  organization  for  high  efficiency  of 
operation.  This  is  especially  important  for  systems  processing  large 
amounts  of  data.  The  effects  of  file  organization  on  systems  efficiency 
and  other  related  optimization  problems  were  recently  investigated  by 
Zunde  (281). 


FORMAL  AUTO  INDEXING  OF  SCIENTIFIC  TEXTS  (FAST) 
FEASIBILITY  AND  SYSTEMS  STUDY 


11.1.  CHARACTERISTICS  OF  A  SCIENTIFIC  UNITERM  INDEX 


The  automatic  indexing  system,  which  was  to  be  designed  under 
this  contract,  had  to  replace  human  indexing  of  scientific  abstracts 
in  projects  such  as  Interagency  Life  Sciences  Supporting  Space  Research 
and  Technology  Exchange  (USE)  of  the  Department  of  Defense  or  NASA  or 
in  similar  projects  where  short  scientific  texts,  available  in  machine 
readable  form,  are  to  be  indexed  for  retrieval.  Samples  of  such  abstracts 
are  shown  in  Annex  1.  It  was  required  that  the  documents  would  be  in¬ 
dexed  by  the  Uniterm  (coordinate)  method  as  it  was  the  case  when  docu¬ 
ments  were  indexed  by  humans.  The  information  was  to  be  stored  on  mag¬ 
netic  tapes  and  searches  were  to  be  made  by  the  computer. 

Prior  to  the  design  of  a  mechanized  substitute  for  human  in¬ 
dexing,  for  this  type  of  input  material,  characteristic  features  and 
parameters  of  a  typical  index  produced  by  humans  were  investigated.  The 
USE  Index  to  the  store  of  research  abstracts  for  the  year  1963  was 
selected  as  a  characteristic  sample.  The  total  number  of  indexed 
documents  in  store  was  2,809.  The  system's  vocabulary  contained  3,146 
indexing  terms.  Since  some  of  these  indexing  terms  were  hyphenated, 
the  actual  number  of  Uni  terms  or  single  words  was  3 » 2 1 0 .  The  total 
number  of  postings  was  37,471,  so  that  on  the  average  there  were  11.91 
postings  per  indexing  term  and  13.34  postings  per  document.  Since  the 
research  tasks  were  basically  oriented  toward  life  sciences,  life  science 
terminology  prevailed  to  a  certain  extent  in  the  vocabulary,  but 
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generally  the  terms  were  not  too  specific.  Some  sample  pages  of  the  USE 
1963  Index  vocabulary  are  reproduced  in  Annex  II. 

The  population  of  3,210  single  words  (Uniterms)  which  appear 
in  the  index  were  analyzed  structually.  For  each  word,  the  number  of 
syllables  was  determined  and  counted.  The  results  are  shown  in  Table  1. 

Table  1.  Breakdown  of  Uni  terms  by  the  number  of  syllables  in 
the  USE  1963  Uniterm  vocabulary. 


No.  of 
Syllables 

(i) 

No.  of 

Words  wi th 
i  Syllables 

[T(  i)] 

Relative 

Frequency 

Total  No.  of 
Syllables  in 
the  Category 
[i  T(i)] 

1 

429 

0.1337 

429 

2 

758 

0.2363 

1,516 

3 

767 

0.2391 

2,301 

4 

635 

0.1979 

2,540 

5 

357 

0.1110 

1,780 

6 

181 

0.0562 

1,085 

7 

59 

0.0184 

419 

8 

21 

0.0065 

168 

9 

2 

0.0006 

18 

10 

1 

0.0003 

10 

TOTAL 

3.210 

1.0000 

10,266 

Furthermore,  for  the  same  population  of  words,  the  number  of 
letters  was  counted  for  each  syllable  and  thus  the  frequency  of  syllables 
of  various  lengths  was  obtained  (see  Table  2). 

The  following  average  values  are  readily  obtained  from  the 
above  data: 
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Average  number  of  syllables  per  word  i  =  3.2001 
Average  number  of  letters  per  syllable  x  =  2.7024 
Average  number  of  letters  per  word  t  =  8.6480 


Table  2.  Breakdown  of  syllables  by  number  of  letters  for  the 
Uniterms  in  the  ILSE  I963  index  vocabulary. 


No.  of 

Letters 

Per  Syllable 
(x) 

No.  of 

Syllables 

With  j  Letters 
[R  (x)] 

Relative 

Frequency 

[(x)] 

Total  No.  of 
Letters  in 
the  Category 
[x  R(x)] 

1 

1,200 

0.1169 

1,200 

2 

3,411 

0.3352 

6,822 

3 

3,444 

0.3353 

10,332 

4 

1.750 

0.1704 

7,000 

5 

384 

0.0374 

1,920 

6 

72 

0.0070 

432 

7 

3 

0.0003 

21 

8 

2 

0.0002 

16 

TOTAL 

10,266 

1.0000 

27,743 

Independently  of  the  above  counts  by  syllabler,  a  character 
count  was  made  for  each  of  the  3,146  indexing  terms  (the  hyphen  in  hypen- 
ated  terms,  such  as  in  MAN-MACHINE,  was  this  time  counted  as  a  character). 
Table  3  shows  a  breakdown  of  the  Indexing  terms  by  number  of  characters 
and  a  plot  of  the  corresponding  frequency  distribution  Is  given  In  Figure  1. 

From  Table  3  we  find  that  the  average  number  of  characters  for 
indexing  term  is  t*~  8.899. 


Table  3.  Breakdown  of  indexing  terms  by  number  of  letters 
(characters)  of  the  USE  1 963  vocabulary. 


No.  of 

Characters 

(t) 

No.  of 

Terms 

M(t) 

1 

Total  No. 
of  Characters 
in  the  Category 
t*  M(t) 

3 

55 

165 

4 

199 

796 

5 

246 

1,230 

6 

299 

1,794 

7 

350 

2,450 

8 

373 

2,984 

9 

379 

3,411 

10 

329 

3,240 

11 

285 

3,135 

12 

194 

2,328 

13 

151 

1,963 

14 

102 

1,428 

15 

71 

1,065 

16 

46 

736 

17 

29 

493 

18 

17 

306 

19 

9 

171 

20 

4 

80 

21 

4 

84 

22 

3 

66 

24 

1 

24 

TOTAL 

3,146 

27,999 

Figure  2  gives  the  distribution  of  subject  word  lengths  by  number  of 
characters  for  the  Stanford  Research  Institute  Uniterm  dictionary,  containing 
2,082  single  word  descriptors, *^and  Figure  3  gives  the  distribution  of  the 


The  plot  was  reproduced  from  a  paper  by  Ch.  P.  Bourne  and  0.  F.  Ford  on 
the  statistics  of  letters  in  English  words  (84). 
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[n*  C£HT) 


word  lengths  of  5,153  most  frequent  words  selected  from  a  sample  of  popular 
magazines.**)  The  corresponding  distribution  of  terms  in  the  ILSE  l 963 
vocabulary  is  shown  for  the  purpose  of  comparison. 


I  *  *  »  10  II  Ik  H  II  >0  >1  Jk 

WMMI  Of  OMMCTUt  P(l  TIM 
)  •  ITI  VOCAUUUtV  Q  •  IUK  VOCAIUtMV 


Figure  2  -  Distribution  of  subject  word  length  f(t)  in  the  SRI  and  ILSE  vocabularies. 

By  comparing  the  plots  on  the  Figure  2  and  Figure  3,  one  can  see  that 
the  distribution  of  subject  word  lengths  in  the  SRI  Uniterm  vocabulary  and 
ILSE  1963  Uniterm  vocabulary  is  rather  similar,  whereas  the  distribution  of 

**)  The  data  was  taken  from  a  paper  by  E.  S.  Schwartz  (231). 
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word  lengths  of  most  frequent  words  in  a  sample  of  popular  magazines  is 
significantly  different  from  both  distributions  in  SRI  and  ILSE  vocabularies. 

We  shall  investigate  next,  whether  the  differences  in  the  distribu¬ 
tion  of  word  lengths  in  natural  language  (which  is  represented  by  the 
sample  of  popular  magazines)  and  in  "indexing  language"  such  as  the  SRI 
and  ILSE  vocabularies  of  scientific  terms  are  basically  due  to  different 
parameters  (average  length  of  terms  and  variance)  of  one  and  the  same 
distribution  function,  or  whether  the  differences  result  from  the  existence 
of  entirely  different  distribution  laws  for  these  families  of  words. 
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Igure  3  -  Distribution  of  word  lengths  f(t)  of  5,153  most  frequent  words  in 
a  sample  of  popular  magazines  and  the  distribution  of  subject  word 
lengths  in  USE  1 963  vocabulary. 
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W.  Fucks  proposed  in  his  paper  (119)  a  mathematical  model  of  the 
word  formation  out  of  syllables  and  syllable  formation  of  letters  in  natural 
language  texts.  Based  on  the  investigations  of  the  process  of  the  formation 
of  words  out  of  syllables,  he  derived  the  following  theoretical  probability 
distribution  function 

-0-1)  / 7  .J-l 

p(i)  =  7m7  i  ' U —  (') 

where  p(i)  is  the  probability  of  occurrence  of  words  with  i  syllables, 
i  ■  l,2,3,...n,  and  i  -  average  number  of  syllables  per  word. 


For  the  probability  v(x)  of  a  syllable  having  x  letters, 
x  -  1>2,3,...m,  the  following  modified  equation  was  obtained 


v(x)-e  (  ?CK)Z(  fjt  )  £K)X  * - 

V=0  V  V+ 1  (x-v)  {  {2) 

where  £K,  K  -  0,1, 2, 3, ...l  are  special  parameters  of  a  given  linguistic 
structure. 


Now,  does  the  above  formula  represent  a  valid  law  of  the  funda¬ 
mental  properties  of  the  word  formation  process  In  the  “Indexing  language" 
as  well,  or  do  the  “Indexing  languages"  obey  laws  of  their  own? 

The  relative  frequency  distribution  p(l)  of  syllables  per 
Indexing  term  or  Uniterm  (I  -  1,2, 3, ...m  syllables)  for  the  scientific  USE 
Uniterm  vocabulary  Is  shown  In  Figure  4.  In  the  same  illustration,  there 


p(') 
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Figure  4:  Relative  frequency  distribution  p(i)  of  syllables  per 

word  in  four  different  texts  and  in  the  ILSE  1 S63  Uniterm 
vocabulary  (i  ■  number  of  syllables  ■  I,  2,  3.  . ..). 


have  been  plotted  relative  frequency  distributions  p ( 1 )  of  words  in 
Shakespeare's  Othel lo.  Huxley's  Antic  Hay,  as  well  as  two  curves  derived 
from  Latin  texts,  l.e.,  Sallust's  Be  11  urn  Jugurthlnum.  and  Caesar's 
De  Bello  Galileo.  Table  4  gives  the  values  of  the  mean  T,  variance^  , 
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entropy  S,  skewnessP^,  ktrtosi  sp^,  and  third  and  fourth  order  moments, 
( i '  and  with  regard  to  the  mean  value  for  these  five  population 

samples.  Fucks  called  these  values  style  characteristics,  since  they 
characterize  the  styie  of  a  particular  author.  We  see  that  the  Uniterm 
set  of  indexing  terms  is  different  in  character  from  either  of  the  four 
other  samples. 


Table  4;  Style  characteristics  of  the  ILSE  Uniterm  Vocabulary 
and  the  two  English  and  two  Latin  texts  from  Figure  4. 


SOURCE 

ma 

D 

ES 

■29 

E9 

S 

ILSE  Uniterm 

3.2001 

1.5289 

2.2464 

0.6286 

27.4959 

4.0327 

0.7779 

Shakespeare 

1.2758 

0.5954 

0.5040 

2.3875 

1.1206 

8.9149 

0.2883 

Huxley 

1 .4087 

0.7770 

0.77^5 

1.6510 

2.0859 

5.7226 

0.3804 

Sallust 

2.5102 

1.1059 

0.6377 

0.4715 

4.2977 

2.8732 

0.6405 

Caesar 

2.5368 

1.2234 

0.9097 

0.4970 

5.8172 

2.5971 

0.6719 

A  comparison  was  also  made  of  the  values  of  the  distribution  function 
p(i)  for  the  "Indextng  language"  and  the  average  values  for  nine  languages 
derived  from  many  texts  of  many  authors  (see  Table  5).  The  latter  figures 
are  taken  from  the  already  quoted  paper  of  W.  Fucks  (119). 
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Tab  1  e  5 ■  Relative  frequency  distributions,  mean  values  and 
entropies  of  the  USE  1963  Uniterm  vocabulary  and 
of  nine  languages  taken  from  a  representative 
average  of  texts  (syllables  per  word). 


p(0 

p(2) 

P  (3) 

P(4) 

p(5) 

p(6) 

p(?) 

P(8) 

p(9) 

p(10) 


UNITERM 


1337 

2363 

2391 

1979 

1110 

0561 

0184 

0065 

0006 

0003 


3.2001 


0.7779 


ENGLISH 


0.7152 
0. 1940 
0.0680 
0.0160 
0.0056 
0.0012 


.351 


0.367 


GERMAN 


5560 

3080 

0938 

0335 

,0071 

0014 

0002 

,0001 


1.634 


0.456 


ESPERANTO 


0.4040 

0.3610 

0.1770 

0.0476 

0.0tt82 

0.0011 


.859 


0.535 


ARAB  I C 


0.2270 

0.4970 

0.2239 

0.0506 

0.0017 


2.104 


0.513 


GREEK  [JAPANESE 


0.3760 

0.3210 

0.1680 

0.0889 

0.0346 

0.0083 

0.0007 


2.105 


0.611 


0.3620 
0.3440 
0. 1780 
0.0868 
0.0232 
0.0124 
0.0040 
0.0004 
0.0004 


2.137 


0.622 


RUSSIAN 


0.3390 

0.3030 

0.2140 

0.0975 

0.0358 

0.0101 

0.0015 

0.0003 


2.228 


0.647 


LATIN 


0.2420 

0.3210 

0.2870 

0.1168 

0.0282 

0.0055 

0.0007 

0.0002 


2.392 


0.631 


TURKISH 


0.1880 

0.3784 

0.2704 

0.1208 

0.0360 

0.0056 

0.0004 

0.0004 


2.455 


0.629 


It  can  be  seen  from  Table  5  that  the  ILSE  "indexing  language"  is 

i 

1  not  similar  to  or  identical  with  any  of  the  above  languages  either. 

Incidentally,  the  average  number  of  syllables  per  term  is  much  closer  to  the 
average  in  Turkish  texts  than  in  English. 

With  i  =  3.2001  for  ILSE  indexing  terms,  the  theoretical  or  expected 
distribution,  calculated  from  the  Eq.  (l),  is  given  in  column  1  of  the  Table  6. 
Column  2  of  that  table  gives  the  actual  distribution.  These  distributions 
are  also  plotted  in  Figure  5. 
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p(i) 


NUMBER  OF  SYLLABLES  I 


Figure  5:  Relative  frequency  distribution  p,.),  theoretical  and 
actual,  of  the  USE  1 963  vocabulary  Uni  terms  (syllables 
per  word) . 

The  comparison  of  the  two  curves,  which  represent  the  actual  distribu¬ 
tion  and  the  expected  distribution  as  calculated  from  Eq.  (l),  shows  that  they 
agree  fairly  well  and  that  Eq.(l)  reflects  at  least  the  main  features  of  the 
process  of  formation  of  words  out  of  syllables  for  the  "indexing  language"  as 
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it  does  for  the  natural  languages.  It  remains  now  to  investigate  how  finer 
characteristics  of  the  word  formation  in  the  "indexing  language"  can  be  derived. 


Table  6.  Theoretical  and  actual  frequency  distribution  of  indexing 
|  terms  in  USE  1963  Uniterm  vocabulary  by  number  of  syllables  (i). 

f 

t 

t  _ 


p(i)  . 

Relative 

Frequency 

Theoretical 

Actual 

p{l) 

0.1109 

0.1337 

P(2) 

0.2435 

0.2363 

p(3) 

0.2679 

0.2391 

p(4) 

0.1966 

0.1979 

p  (5) 

0.1083 

0.1110 

p(6) 

0.0477 

0.0561 

P  (7) 

0.0176 

0.0184 

p(8) 

0.0057 

0.0065 

p(9) 

0.0015 

0.0007 

p(io) 

0.0003 

0.0003 

To  obtain  the  distribution  of  the  number  of  letters  in  syllables, 
we  shall  use  the  Eq.  (2).  The  parameters  6  for  that  equation  are  found 
as  follows. 


First  we  derive  the  characteristic  function  for  that  distri¬ 
bution.  It  appears  to  be: 


OP 

M  (Ju)  »Y"  v(x)«*“* 


00  00 


(s-7  tj 

eK if  f(£  ,  .  1 

*  1  L>  L^V  S+i'  (x-wi 


.jux 


X*V  }J*0 

CO  00 


.  Y,  (VS+I1  *|uy 

Vm0 


(3) 
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From  the  characteristic  function  we  can  derive  moments  of  any  order 
by  noting  that: 


lim 
u— 0 


s 

d  un 


v(x)e*ux 


xnv(x) 


(4) 


Where  Mn  is  the  n-th  order  moment  about  the  origin. 

Assuming  that  £Q=  £,  =  I ,  £4=  £$  « . aO  and  £2#0,£3#0, 

we  get 


M,  = 

M,* 

/V 


tK)  +  ,  +  t2+  Es*  X 


(5) 


I 

**+ S  -  I- (£j+ es)*+2£4  (6) 

**+ 3X*-2*-3»  (£t+Es)*+ 6X£S  (7) 

3(l  +  £t+£,)*+2(l  +  6t+£s)*-6(£2+£#)(Ej[+2e3)  +  6£s 
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With  regard  to  the  mean,  the  corresponding  moments  are: 


m  =  0 


(8) 


m2s  *-*“  (£*+£,)  +2£3 


(9) 


m3=  X-3(l  +  £2+€3)  +  2(l  +  £2+£3)  -  6 (£2+  £3)(£2+  2£3)+6£3  do) 


From  our  sample  population  we  have: 


X=L=  27024 

1.1167 

nr.s.^s-3/ieMl+2/i**0.37l7 


With 


£2=  -e3±  l  +  X+2£3-m2  «  -£^±y  2 £3+ 0.5857 


From  Eq.  (8),  we  obtain  by  substitution  in  Eq.  (9) 


8E*-  7.0284  e,  +  0.6219  «  0 


Hence 


£3-  0.405 


and 


£t-  0.777 


Substituting  these  values  into  Eq.  (2),  we  get  the  following  distribution 
function  of  letters  in  syllables  for  Uniterm  indexing  terms: 

v (x)  *  0.0943  [o.223  +  0.372 0.405  *&£  *] 
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Table  7  gives  theoretical  distribution  and  actual  distribution 
of  letters  per  syllable  and  Figure  6  shows  the  plot  of  these  two  curves. 


There  again  is  a  satisfactory  correspondence  between  theoretical  and 
actual  values  and  better  fits  could  be  obtained  by  introducing  additional 
parameters  £*,  calculated  from  higher  order  moments. 


Table  7:  Theoretical  and  actual  frequency  distribution  of  the 
number  of  letters  in  syllables  in  the  ILSE  1963 
Uniterm  vocabulary. 


No.  of  Letters 

Per  Syllable  (x) 

Relative  Frequency 

Theoretical 

Actual 

1 

0.1325 

0.1169 

2 

0.3590 

0.3352 

3 

0.37^5 

0.3355 

4 

0.1583 

0.1705 

5 

0.0381 

0.0374 

6 

0.0064 

0.0070 

7 

0.0012 

0.0003 

8 

0.0001  i 

0.0002 

Thus  we  can  conclude  that  certain  probabilistic  laws  do  govern 
the  formation  of  indexing  terms  from  syllables  and  letters.  If  the 
"style  characteristics"  viz.  moments  of  various  orders  of  the  distribu¬ 
tion  of  terms  by  syllables  and  letters  are  known  or  can  be  obtained  from 
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I  2  3  it  5  6  7  8 

NUMBER  OF  LETTERS  PER  SYLLABLE  j  . 


Figure  6:  Relative  frequency  distribution,  theoretical  and  actual, 
of  syllables  by  the  number  of  letters  in  USE  1963 
Uniterm  vocabulary. 

a  representative  sample,  it  should  be  possible  to  theoretically  calculate 
the  most  probable  distribution  of  the  terms  for  populations  of  any  type 
and  size  with  satisfactory  accuracy. 
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This  is  of  great  practical  value  in  calculating  required 
memory  space  to  store  lists  of  terms  in  computer  memory,  designing  match 
procedures,  deriving  the  number  of  significant  characters  for  terms  on 
the  authority  lists,  and  optimizing  systems  performance.  Applications 
of  this  kind  were  made  in  designing  the  Formal  Autoindexing  of 
Scientific  Texts  (FAST)  System  described  in  the  following  chapters. 
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11,3  FORMAL  AUTO  INDEXING  OF  SCIENTIFIC  TEXTS  (FAST)  SYSTEM 


It  is  assumed  that  the  input  into  the  system,  which  consists  of  short 
scientific  abstracts,  is  available  in  computer  readable  form.  At  this  stage  of 
development,  this  means  that  the  text  is  available  on  magnetic  tapes,  although 
the  method  of  reading  the  abstracts  by  the  machine  is  immaterial  to  the  FAST 
program.  For  instance,  magnetic  tapes  could  be  replaced  by  optical  scanning 
devices,  in  which  case  the  abstracts  would  be  read  from  printed  copies. 

There  are  no  particular  requirements  for  the  conversion  of  the 
texts  of  the  abstracts  to  machine  readable  form  except  that  the  words  should 
not  be  broken  apart  at  the  end  of  the  line  for  the  purpose  of  carrying  them 
over.  However,  it  is  possible  that  certain  requirements  might  be  originated  by 
the  user  as  part  of  the  overall  systems  specifications,  for  instance,  fixed 
positions  for  certain  proper  names,  spelling  of  chemical  compounds,  etc. 

The  indexing  terms  are  extracted  from  the  abstracts  as  the  computer 
scans  the  text  word  by  word.  Blank  spaces  indicate  to  the  computer  the  beginnirg 
and  the  end  of  a  word.  The  essential  parts  of  the  FAST  system  are:  a 
programmed  mechanism  for  eliminating  words  which  under  no  circumstances  can 
be  considered  potential  Indexing  terms  (Kill  List  Program),  a  programmed 
mechanism  for  selecting,  editing  and  cumulating  significant  terms  (Authority 
File  Program)  and  a  programmed  mechanism  for  implementing  human  control  and 
optimization  capability  in  unresolved  cases  (Residue  Editing  Program). 
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The  mechanism  for  eliminating  words  which  are  unacceptable  indexing 
terms  consists  of  a  set  of  computer  instructions  to  delete  terms  which  match 
with  the  terms  on  the  Kill  List  especially  designed  for  this  system.  The  match 
has  to  be  complete  on  all  characters  for  the  computer  to  delete  the  word. 

Every  word  which  is  put  into  the  system,  be  it  title,  body  or  footnote 
of  the  abstract,  is  matched  against  the  Kill  List,  but  there  is  one  exception  to 
the  deletion  instruction.  Certain  words,  though  they  appear  on  the  Kill  List, 
are  not  deleted  if  they  are  part  of  the  title.  Therefore,  before  deleting 
the  words  in  the  titles,  the  computer  compares  them  with  a  Title  Exemption  File. 
If  a  word  appears  in  that  file,  it  is  not  deleted  as  would  be  the  case  if  it 
were  found  in  the  body  of  the  abstract,  but  is  flagged  and  retained  as  indexing 
term. 


The  reason  for  this  is  that  there  is  a  category  of  words,  which  under 
most  circumstances  would  be  undesirable  indexing  terms,  hjt  in  certain  cases 
might  become  acceptable.  Consider  words  such  as  ATTENTION,  DURATION,  OPINION, 
WORK,  etc.  In  sentences  like:  "The  investigator  paid  much  attention  to  the 
proper  selection  of  test  animals" or  "The  work  progressed  satisfactorily,"  the 
words  attention  and  work  would  not  be  significant  enough  to  justify  their 
selection  as  indexing  terms.  But  let's  take  now  the  sentences:  "Investigation 
of  the  factors  influencing  the  attention  of  astronaut  under  severe  flight 
conditions"  or  "Measuring  the  efficiency  of  work  of  pr.imates."  There  the  same 
words  attention  and  work  are  significant  Indicators  of  the  content  of  the 
documents  to  be  Indexed.  It  has  been  established  that  usually  words  of  this 
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type  become  significant  indexing  terms  if  the  processes  or  objects  they 
designate  are  subjects  of  a  study  or  investigation.  In  such  cases  there  is  a 
high  probability  that  these  words  will  appear  in  the  titles  or  headings  of 
the  abstracts  describing  such  scientific  tasks  or  projects.  The  function 
of  the  above  described  exemption  mechanism  for  titles  is  to  detect  such  words 
and  convert  them  to  indexing  terms. 

From  what  remains  after  deletion  of  insignificant  words,  the  computer 
selects,  edits,  and  cumulates  significant  indexing  terms  (Authority  File 
Program).  The  basic  element  of  this  mechanism  is  a  file  of  terms  considered 
to  be  acceptable  indexing  terms  for  the  particular  type  of  input.  This  file 
is  called  the  Authority  List.  For  different  subjects  fields  of  input,  Authority 
Lists  might  be  different. 

Words  in  the  abstract  are  matched  against  the  terms  on  the  Authority 
List.  However,  this  time  a  complete  match  is  not  required  on  all  characters  but 
only  on  certain  significant  characters  which  are  specifically  identified  for 
each  term  on  the  Authority  List  (see  Annex  III).  The  longest  match,  if 
there  is  a  match  at  all,  of  a  given  word  from  the  text  on  the  significant 
characters  of  a  term  in  the  Authority  List  is  considered  a  "hit."  If  there  is 
a  "hit,"  the  term  on  the  Authority  List  is  accepted  and  printed  as  the  indexing 


To  illustrate  the  procedure,  consider  the  word  CONDITIONAL.  The 
Authority  List  might  contain  terms  (asterisk  indicates  the  end  of  significant 
characters)  such  as 

CONDITION*  (9  significant  characters) 

C0NDITI0NE*d  (10  significant  characters) 

C0NDITI0NI*ng  (10  significant  characters) 

The  FAST  program  will  start  matching  the  word  CONDITIONAL  agaisnt 
CONDITIONING  and  then  against  CONDITIONED,  since  these  two  have  the  greatest 
number  of  significant  characters  in  the  batch  of  terms  against  which  the  word 
CONDITIONAL  is  matched.  Since  the  word  does  not  match  with  either  of  these 
terms  on  10  significant  characters,  it  is  next  matched  against  the  term  CONDITION 
which  requires  matching  on  9  significant  characters.  The  word  CONDITIONAL 
does  match  on  the  first  nine  characters  of  the  term  CONDITION  on  the  Authority 
List,  and  therefore,  the  term  CONDITION  (but  not  the  word  CONDITIONAL)  is  assigned 
to  the  corresponding  abstract  as  indexing  term. 

The  subsets  of  terms  of  the  Authority  List,  against  which  a  word 
from  the  abstract  being  processed  is  matched,  are  obtained  by  sorting  the  terms 
of  the  Authority  List  on  first  three  characters.  Within  the  subsets,  the  words 
are  sorted  by  the  number  of  significant  characters  In  increasing  order  and 
alphabetically  within  the  sub-subsets  of  terms  with  the  same  number  of 
significant  characters. 

The  subsets  can  be  formed  also  by  sorting  the  terms  of  the  Authority 
List  only  on  the  first  two  characters  instead  of  the  first  three,  if  the  file 
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is  not  too  large.  Three  characters  are  the  upper  limit  for  this  purpose,  since 
this  is  the  lowest  number  of  significant  characters  a  term  might  be  designed  to 
have  (there  are  no  terms  on  the  Authority  List  with  two  or  one  significant 
characters) . 

There  is  also  a  rule  relating  the  number  of  significant  characters 
against  which  a  word  is  matched  and  the  total  number  of  characters  in  that  word. 
This  rule  says  that  if  the  word  is  five  characters  long  or  less,  it  is  matched 
only  against  those  terms  on  the  Authority  List  which  have  five  significant 
characters  or  less.  If  the  word  is  six  characters  long  or  longer,  it  is  matched 
only  against  such  terms  of  the  Authority  List  which  have  five  significant 
characters  or  more.  Thus,  the  word 

BATTERIES 

in  the  text  of  an  abstract  would  be  matched  against  the  Authority  List  term 

BATTER*Y 

on  six  significant  characters  and  indexed  by  this  term,  but  it  would  not  be 
matched  against  the  Authority  List  term 

BAT* 

on  three  significant  characters,  even  if  the  Authority  List  would  not  contain 
BATTER*Y.  Similarly,  this  rule  would  prevent  the  word  DISCONTINUITY  being 
accepted  by  the  Authority  list  term  DISC*,  PUMPERNICKEL  by  PUMP,  etc.  This 
rule  had  to  be  applied  because  with  the  decreasing  number  of  significant 
characters,  the  discriminating  power  of  the  Authority  List  terms  with  regard  to 
longer  words  decreases  very  significantly. 


The  above  described  mechanism  of  the  selection  of  significant  terms 
performs  at  the  same  time  the  important  function  of  editing  the  index  by  combining 
such  similar  terms  which  agree  on  the  significant  number  of  characters.  Thus,  as 
a  result  of  the  editing  procedure,  the  abstracts  containing  the  words 

DIFFRACTION 

DIFFRACTIONS 

DIFFRACTED 

DIFFRACTIVE 

DIFFRACTS 

would  be  posted  under  the  index  term  DIFFRACTION  and  the  abstracts  containing 
the  words 

INHOMOGENCITY 

INHOMOGENEITIES 

INHOMOGENEOUS 

INHOMOGENEOUSIY 

would  be  posted  under  the  index  term  INHOMOGENE ITY.  To  give  one  more  example 
abstracts  containing 

MAGNETIC 

MAGNETICALLY 

MAGNETIZE 

MAGNETIZATION 

MAGNET 

MAGNETS 
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MAGNETO 


MAGNETISM 

would  be  posted  under  the  index  term  MAGNETISM. 

The  programmed  mechanism  to  provide  human  control  and  systems  optimiza¬ 
tion  capability  (Residue  Editing  Program)  consists  of  three  subroutines: 

a.  Subroutine  for  the  generation  of  the  Residue  Record  with  the 
frequency  count  of  terms. 

b.  Subroutine  for  updating  Kill  List. 

c.  Subroutine  for  updating  Authority  List. 

The  subroutine  for  the  generation  of  Residue  Records  produces  a  listing 
of  words  which  do  not  match  either  with  the  Kill  or  with  the  Authority  List. 
Furthermore,  it  counts  the  frequency  of  occurrence  of  such  words  and  lists  them 
in  the  decreasing  order  of  occurrence.  Basically,  there  could  be  three  categories 
of  words  appearing  on  the  Residue  Record:  words  which  ei  *  not  acceptable  as  index¬ 
ing  terms,  but  which  were  not  included  in  the  Kill  List,  words  which  should  have 
generated  indexing  terms  but  did  not  do  so  because  there  were  no  matching  terms  in 
the  Authority  List,  and  words  which  did  not  match  with  an  existing  term  on  the 
Kill  or  Authority  List  because  of  spelling  errors. 

The  Residue  Record  is  periodically  reviewed  by  a  human  editor,  in 
addition  to  correcting  misspelled  words,  the  human  editor  updates  the  index  by 
adding  the  indexing  terms  derived  from  the  significant  terms  and  optimizes  the 
system  using  the  feedback  for  updating  the  Kill  List  and  the  Authority  List. 
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In  both  cases,  the  frequency  of  appearance  of  the  candidate  terms  for  the  Kill 
and  Authority  List  in  the  Residue  record  serves  as  a  criterion  for  creating  new 
kill  and  authority  terms.  Thus,  if  an  insignificant  word  appears  reasonably  often 
in  the  texts,  it  would  be  placed  on  the  Kill  List,  but  if  such  a  word  appears 
seldom,  it  might  not  be  economically  justifiable  to  create  a  new  term  for  the  Kill 
List,  since  this  increases  processing  time.  Similar  considerations  apply  to  the 
updating  of  the  Authority  List. 

As  a  final  product,  the  system  delivers: 

a.  Subject  Index  to  the  documents  in  store.  This  is  the  index  file 
sorted  by  indexing  terms  and,  when  printed,  it  gives  indexing 
terms  in  alphabetical  sequence  with  the  accession  numbers  of 
documents  to  which  these  terms  were  assigned.  The  subject 
index  can  be  produced  with  or  without  cross-references,  de¬ 
pending  on  users  requirements  (See  Annex  X  for  a  sample  page). 

b.  Sets  of  indexing  terms  assigned  to  single  documents.  This  is 
the  index  file  sorted  by  accession  numbers.  The  sets  of  index¬ 
ing  terms,  or,  as  they  are  often  referred  to  in  this  contract, 
the  sets  of  key  words  would  usually  be  printed  with  the 
abstracts,  if  such  a  print-out  is  at  all  required.  (See  Annex  XI 
for  a  sample.) 

Figures  7  through  11  show  the  flow  charts  of  the  system  and  its 
components. 
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FORMAL  AUTO  INDEXING  OF  SCIENTIFIC  TEXTS  (FAST)  SYSTEM 


Figure  7 
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Figure  10 
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Figure  II 
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11.4.  CHARACTERISTICS  OF  THE  INPUT  INTO  THE  FAST  SYSTEM 

It  was  already  mentioned  that  the  FAST  system  was  designed  to 
process  abstracts  of  scientific  documents  or  short  descriptions  of 
research  endeavors  written  in  concise  scientific  language.  This  means 
that  the  system  was  optimized  for  this  particular  type  of  input.  The 
length  of  a  single  item  (abstract  or  task  description)  was  to  be 
approximately  200  words  (see  Samples  in  Annex  I). 

Three  random  samples  were  drawn  from  the  total  population  of 
USE  and  OAR  abstracts  in  store  for  a  more  detailed  investigation  of 
the  characteristics  of  the  input.  The  first  sample  contained  142 
abstracts,  the  second  and  third  contained  30  abstracts  each.  The 
average  length  of  the  documents  used  in  actual  tests  of  the  system 
i s  given  i n  Table  8. 


Table  8:  Number  of  documents  and  number  of  words  in  documents 
used  in  testing  FAST  system. 


Sample 

No. 

No.  of 
Documents 

Min.  No.  of 
Words  in  a 
Document 

Max.  No.  of 
Words  in  a 
Document 

Average  No. 

Of  Words  Per 
Document 

1 

142 

10 

260 

91.7 

2 

30 

33 

233 

114.4 

3 

30 

58 

272 

139.8 
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The  total  number  of  word  occurrences  in  the  documents  of 


Sample  No.  1  was  12,792.  The  number  of  different  words  in  this  popula¬ 
te) 

tion  of  word  tokens  was  2 , 84 1 ,  so  that  on  the  average  the  same  word 
occurred  4.503  times.  Figures  in  Table  9  relate  the  number  of  different 
word  types  to  the  number  of  their  occurrences  in  this  document  group. 

A  corresponding  plot  of  the  number  of  word  types  versus  the  number 
of  their  occurrences  is  shown  in  Figure  12  on  a  logarithmic  scale. 

The  rank-frequency  order  of  the  20  most  frequent  words  in  the 
above  sample  was  as  shown  in  Table  10. 

E.  S.  Schwartz  (230  reported  60  most  frequent  word  types  obtained 
after  processing  10,000  and  19,710  word  tokens  from  7  popular  magazine  articles. 
The  first  20  words  from  his  list  are  reproduced  in  Table  11. 

It  is  noted  that  only  the  rank  4  of  the  list  in  Table  10  and  of 
the  10,000  token  list  of  Table  11  identical  as  well  as  ranks  4  and  14  for  the 
19,710  token  list.  The  ranks  1  through  10  of  the  words  of  the  Sample  No.  1 
USE  documents  appear  as  ranks  2- 1 -5-4-6-3-30- (W 1 LL  is  not  included  in  the  *irst 


*)  The  term  "word  tokens"  is  used  here  in  the  sense  of  each  word  occurence 
in  the  text,  some  of  which  are  exactly  alike  in  their  character  structure, 
whereas  "word  types"  Is  the  subset  of  word  tokens  each  one  identifiable 
by  a  different  character  structure. 
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Table  9.  Number  of  occurrences  of  word  types  in  the  population  of  12,792 
text  words  (word  tokens)  of  sample  No.  1  ILSE  documents. 
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Figure  12:  A  plot  of  word  frequencies  versus  word  types  for  the 
sample  No.  1  i LSE  Documents. 
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Table  10,  Rank-frequency  6rder  of  the  words  in  the  sample  No.  1  USE  documents 


Rank 

Word  Type 

Frequency 

1 

OF 

808 

2 

THE 

75  6 

3 

AND 

573 

4 

TO 

394 

5 

IN 

306 

6 

A 

193 

7 

BE 

183 

8 

WILL 

182 

9 

FOR 

174 

10 

THIS 

128 

11 

IS 

117 

12 

ON 

100 

13 

AS 

88 

14 

ARE 

69 

15 

WITH 

67 

16 

RESEARCH 

65 

17 

STUDY 

63 

18 

STUDIES 

61 

19 

WHICH 

60 

20 

BY 

59 

60  ranks)-ll-26  on  the  10,000  token  llst*^  of  Table  11  and  as  ranks  2-1-5-4- 
6-3-l5-(WI LL  Is  again  not  included  in  the  first  60  ranks) - 1 1 -27  on  the  19,710 
token  list.  The  rank  correlation  between  the  top  ten  words  (WILL  in  Table  10 
Is  substituted  by  the  next  word)  of  the  list  Is  -4.106  and  -1,292  respectively. 
The  ten  top  words  comprise  28,9  percent  of  word  occurrences  in  the  Sample 
No.  I  documents,  whereas  they  comprise  only  23  percent  of  the  word  occurrences 
in  the  19,710  word  sample.  The  20  top  words  comprise  34.7  and  30  percent  of 
word  occurrences  respectively. 


*) 


Schwartz  gives  in  his  paper  (231)  as  many  as  first  60  ranks  of  word 
types  In  order  of  their  frequency. 


Table  11.  Rank-Frequency  order  of  Word  occurrences  in  7  magazine  articles 


10,000  Tokens 


Rank 

Word  Type 

Frequency 

1 

THE 

657 

2 

OF 

323 

3 

A 

274 

4 

TO 

247 

5 

AND 

234 

6 

IN 

196 

7 

THAT 

109 

8 

IT 

105 

9 

HE 

97 

10 

IS 

97 

11 

FOR 

79 

12 

WE 

79 

13 

ON 

75 

14 

1 

73 

15 

HIS 

69 

16 

WAS 

64 

17 

THEY 

62 

18 

YOU 

62 

19 

WITH 

61 

20 

AS 

59 

19,710  Tokens 


Rank 

Word  Type 

Frequency 

1 

THE 

1192 

2 

CF 

677 

3 

A 

541 

4 

TO 

518 

5 

AND 

462 

6 

IN 

450 

7 

THAT 

242 

8 

HE 

105 

9 

IS 

190 

10 

IT 

181 

11 

FOR 

157 

12 

HIS 

138 

13 

ON 

134 

14 

ARE 

124 

15 

BE 

123 

16 

WITH 

121 

17 

1 

112 

18 

HAVE 

111 

19 

WAS 

111 

20 

YOU 

106 

Finally,  the  relation  between  the  total  number  of  word  occurrences 
and  the  number  of  different  words  (word  types)  in  ILSE  documents  was  investigated 


and  compared  with  available  data  on  other  texts.  The  data  for  this  comparison 
were  taken  again  from  the  above  referenced  paper  of  Schwartz  (231).  The  results 


are  summarized  in  Table  12. 


Table  12.  Word  Counts  by  tokens  and  types 


Study 

Date 

Material 

Words 

Percentage 

Tokens 

: 

Types 

Of  Types 

El  dr ige 

1911 

Newspaper  articles 

43,989 

6,002 

13.6 

Dewey 

1923 

Miscel laneous 

100,000 

10,161 

1C. 2 

Hanley 

1937 

Joyce's  "Ulysses" 

260,430 

29,899 

11.5 

Thorndike 

19** 

Miscel laneous 

18,000,000 

30,000 

- 

Mi  1 ler-Newman 

1958 

Miscel laneous 

36,299 

5,537 

15.2 

Armour 

Research 

I960 

Mi  1 i tary  exercise 

38,992 

2,081 

5.3 

ILSE 

Documents 

1965 

Scientific  task 
descriptions 

12,792 

2,841 

22.2 

Following  conclusions  can  be  derived  from  the  above  investigations: 

1.  By  deleting  duplicates,  the  population  of  words  (word  tokens)  in 
ILSE  type  documents  can  be  condensed  to  approximately  22  percent 
of  its  original  volume. 

2.  The  degree  of  the  condensation  thus  achieved  is  less  than  for 
non-sc ientific  texts  or  for  texts  not  in  abstract  form. 

3.  The  list  of  most  frequent  words  in  scientific  abstracts  and  in 
articles  from  popular  magazines  differ  considerably  both  in  the 
rank  order  of  identical  word  types  and  in  the  word  types  themselves. 
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11.5.  DESIGN  AND  TESTING  OF  SYSTEMS  COMPONENTS 


A,  Kill  List.  The  Kill  List  was  designed  to  eliminate  from  the  input 
data  terms  which: 

(1)  do  not  carry  any  information  at  all,  such  as  words:  of, 
the,  but,  are,  have,  etc. 

(2)  cannot  be  considered  acceptable  indexing  terms  because  they 
possess  little  discriminatoi y  power  in  the  specific  environment 
of  their  occurrence.  For  the  particular  type  of  documents 
processed,  this  category  includes  such  terms  as:  RESEARCH, 
STUDY,  TASK,  etc. 


On  the  other  hand,  a  word  might  belong  to  one  of  the  above  described 
categories  and  yet  not  be  placed  on  the  Kill  List  because  it  does  not  appear 
often  enough  in  the  text  to  make  such  an  inclusion  desirable  or  economically 
justifiable.  For  one  thing,  certain  limits  as  to  the  practical  size  of  the 
list  are  set  by  the  computer's  memory  capacity.  Furthermore,  checking  whether 
a  term  on  the  Kill  List  appears  in  the  text  requires  a  certain  amount  of 
computer  time,  and  If  the  possibility  of  such  occurrences  is  low,  it  might  be 
worth  while  to  let  it  appear  on  the  residue  list  of  words  which  do  not  match 
either  with  the  Kill  List  or  with  the  Authority  List.  In  other  words,  the 
final  criterion  for  the  inclusion  of  a  term  into  the  Kill  List  is  a  trade-off 
decision  which  takes  into  consideration  the  economics  of  computer  processing 
time  versus  the  economics  of  human  editing  of  the  no-match  residue  of  the 


input  data  (Residue  Record). 
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Samples  No.  2  and  No.  3,  of  30  abstracts  each,  were  processed 
against  the  Kill  List  containing  1,162  terms.  This  Kill  List  was  derived 
from  the  Sample  No.  1  of  1 42  abstracts.  The  results  of  the  condensation  of 
text  thus  achieved  are  shown  in  Table  13. 

Table  13.  Data  on  the  processing  of  Samples  No.  2  and  No.  3  against 
the  Kill  List  of  1,162  terms 


Number  of 

No.  of  Word 

Percentage 

Word  Tokens 

Tokens  Eliminated 

Of 

Processed 

By  the  Kill  List 

Reduction 

Sample  No.  2 

3434 

2,182 

63.5 

Sample  No.  3 

4194 

2,332 

55.6 

The  first  seventy- three  most  frequent  words  eliminated  from  the 
word  population  of  Sample  No.  2  by  processing  against  the  Kill  List  are 
1  isted  in  Table  14. 

The  words  in  Table  14  account  for  46.1  percent  of  all  word 
occurrences  in  the  documents  of  Sample  2.  Thus  the  remaining  1,089  terms  on 
the  Kill  List  produced  an  additional  reduction  of  the  original  volume  of 
words  of  17.4  percent  only. 

B.  Authority  List.  It  has  been  already  mentioned  that  in  addition  to  its 
prime  function  of  selecting  significant  terms,  the  Authority  File  Program  was 
designed  also  to  combine  conceptually  related  terms,  which  function  corresponds 
to  the  human  process  of  editing  the  index  vocabulary.  For  conceptually  related 
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Table  14.  First  seventy-three  words  deleted  from  the  Sample  No.  2  by 

processing  against  the  Kill  List  in  their  order  of  frequency. 


Word 

No.  of 
Occurrences 

%  of  Total 

No.  of  Deletes 

THE 

242 

11.0 

OF 

235 

10.7 

AND 

130 

5.9 

TO 

115 

5.2 

IN 

107 

4.9 

A 

46 

2.1 

IS 

43 

1.9 

BE 

36 

1.6 

THIS 

35 

1.6 

ON 

31 

1.4 

29 

1.3 

WILL 

29 

1.3 

STUDIED 

26 

1.1 

BEEN 

21 

0.9 

FOR 

21 

0.9 

ARE 

20 

0.9 

BY 

19 

0.8 

RESEARCH 

16 

0.7 

AS 

15 

0.6 

DETERMINE 

15 

0.6 

WHICH 

14 

0.6 

FROM 

13 

0.5 

OR 

13 

0.5 

HAS 

11 

0.5 

HAVE 

11 

0.5 

AN  . _ 

10 

0.4 

HUMAN 

10 

0.4 

SYSTEMS 

10 

0.4 

TASK 

10 

0.4 

THAT 

10 

0.4 

BEING 

9 

0.4 

DURING 

9 

0.4 

VARIOUS 

9 

0.4 

AT 

8 

0.3 

EFFECTS 

8 

0.3 

MADE 

8 

0.3 

PURPOSE 

8 

0.3 

UNDER 

8 

0.3 

HIGH 

7 

0.3 

INVESTIGATION 

7 

0.3 

OTHER 

7 

0.3 

Table  14.  (Continued) 


Word 

No.  of 
Occurrences 

%  of  Total 

No.  of  Deletes 

TYPES 

7 

0.3 

WAS 

7 

0.3 

BETWEEN 

6 

0.2 

CONDUCTED 

6 

0.2 

FACTORS 

6 

0.2 

NORMAL 

6 

0.2 

SUCH 

6 

0.2 

WOULD 

6 

0.2 

CHANGES 

5 

0.2 

DEVELOPMENT 

5 

0.2 

EFFECT 

5 

0.2 

INCLUDE 

5 

0.2 

INTO 

5 

0.2 

IT 

5 

0.2 

MAN 

5 

0.2 

SYSTEM 

5 

0.2 

ASSOCIATED 

4 

0.1 

BOTH 

4 

0.1 

CERTAIN 

4 

0.1 

FOUND 

4 

0.1 

MEASURES 

4 

0.1 

MORE 

4 

0.1 

PROLONGED 

4 

0.1 

PROVIDE 

4 

0.1 

RELATIONSHIP 

4 

0.1 

RELATIONSHIPS 

4 

0.1 

THAN 

4 

0.1 

THESE 

4 

0.1 

USE 

4 

0.1 

VARIABLES 

4 

0.1 

WERE 

4 

0.1 

YIELD 

4 

0.1 

terms,  which  have  certain  characters  in  sequential  order  in  common,  this 
condensation  and  editing  is  achieved  by  matching  on  significant  characters 
only.  The  reduction  in  the  number  of  extracted  significant  words  after 
their  transformation  into  the  new  set  of  Indexing  terms  (Uni terms)  actually 
appearing  in  the  subject  index  produced  by  FAST  is  shown  in  the  Table  15. 
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Table  15.  Reduction  of  the  number  of  potential  indexing  terms  for  ILSE 
sample  documents  in  the  process  of  transformation  by 
matching  on  significant  characters. 


No.  of  Significant 
Words  Before 
Transformation 

No.  of  indexing 
Terms  After 
Transformation 

%  of 

Reduction 

Sample  No.  1 

1,522 

1,114 

26.8 

Sample  No.  2 

503 

412 

18.1 

Sample  No.  3 

422 

319 

24.4 

C.  Residue  Record.  By  regularly  checking  the  Residue  Record  and  updating 
both  the  Authority  File  and  the  Kill  List  as  described  in  Section  6,  it  is 
possible  to  steadily  reduce  the  number  of  words  that  were  neither  killed 
nor  accepted  by  the  Authority  File  Program.  Specifically,  by  regular  updating 
it  is  possible  to  quickly  reduce  the  number  of  significant  words  in  the 
Residue  Record,  provided  there  are  no  essential  changes  in  the  subject  field 
coverage  of  the  documents  processed.  A  sudden  increase  of  significant  words 
viz.  potential  indexing  terms  in  the  Residue  Record  unmistakable  indicates 
that  the  input  contains  documents  from  a  different  field  of  knowledge  than 
the  one  for  which  the  system  was  primarily  designed  and  optimized. 

Table  16  gives  numerical  data  on  these  residue  records  for  the 
sample  No.  2  and  No.  3. 


I 
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Table  16.  Evaluation  of  the  Residue  Record  (no-match  output)  for 
ILSE  sample  No.  2  and  No.  3  documents 


Total  No.  Total  No. 

of  Word  of  Word- 

Occurrences  Types 

on  the  on  the 

Residue  Residue 

Record 


No.  of  No.  of 

Word  Types  Word 

Accepted  as  Types 

Indexing  Rejected 

Terms  per  per 

Document  Document 


11.6.  DEPTH  OF  FAST  INDEXING  AND  COMPARISON  WITH  HUMAN  INDEXING 


Machine  generated  indexing  terms  for  each  of  the  Sample  No.  2  and 
No.  3  documents  were  compared  with  the  indexing  terms  assigned  to  the  same 
documents  by  human  indexers  for  depth  of  indexing  and  commonality. 

For  the  first  set  of  30  test  documents,  (sample  No.  2)  the  FAST 
program  assigned  approximately  twice  as  many  indexing  terms  as  the  human 
indexers  did.  57.3  percent  of  the  indexing  terms  assigned  by  human 
indexers  were  picked  also  by  the  machine  on  the  first  run.  After  editing 
the  residue  and  updating  the  Authority  List,  this  figure  increased  to  65.2 
percent.  However,  for  that  set  of  documents,  these  figures  could  not  be 
considered  unbiased  because  human  indexers  had  information  available  which 
was  not  part  of  the  input  for  automatic  indexing  process. 

For  the  second  set  of  30  test  documents  (sample  No.  3),  the  machine 
assigned  approximately  46.4  percent  more  Indexing  terms  per  document  than 
human  indexers  did.  The  respective  figures  o  terms  common  with  the  terms 
selected  by  human  indexers  were  59.8  percent  before  update  and  63.8  percent 
after  update  (see  also  Table  17). 

The  analysis  of  the  terms,  which  were  assigned  by  the  human  indexers 
but  not  by  the  FAST  program,  disclosed  two  major  reasons  for  their  appearance: 

1.  The  indexers  would  assign  more  generic  terms  in  addition  to  the 
terms  in  the  abstract  (e.g.  HYDRODYNAMICS  in  addition  to 
MAGNETOHYDRODYNAMi CS  when  only  the  latter  appeared  in  the  text). 

2.  The  indexers  would  assign  synonymous  terms  (e.g.  PLASMA  when 
MAGNETOHYDROOYNAMICS  appeared  in  text). 
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In  most  cases  this  can  be  done  also  mechanically  by  more  elaborate 
posting  instructions  in  the  computer  program,  if  such  a  capability  is 
required.  Although  the  techniques  for  this  were  developed,  they  were  not 
incorporated  into  the  FAST  program. 

The  relation  between  the  total  number  of  words  in  an  abstract  and 
the  number  of  indexing  terms  assigned  by  the  FAST  program  was  also  investigated. 
Annex  IV,  I  terns  1  and  2  give  word  counts  by  document  with  corresponding 
numbers  of  indexing  terms  assigned  by  the  FAST  program  and  the  ratios  of  the 
number  of  indexing  terms  to  the  number  of  words  in  the  documents.  It  was 
established  that  there  is  a  strong  rank  correlation  between  the  number  of 
words  in  the  document  and  the  number  of  indexing  terms  assigned  to  that 
document.  For  sample  No.  2,  the  rank  correlation  coefficient  is  0.9106  and 
for  sample  No.  3,  it  is  0.-.57  (See  Table  18  and  19).  However,  the  relation 
is  not  linear.  This  is  clearly  demonstrated  by  calculating  the  ratio  of  the 
indexing  terms  to  the  number  of  words  in  the  document.  Those  ratios  are 
plotted  in  the  chart  Figure  13.  The  rank  correlation  coefficients  for  these 
sets  of  values  also  indicate  that  there  is  practically  no  correlation  between 
the  document  length  in  terms  of  number  of  words  and  the  ratio  of  indexing 
terms  to  the  number  of  words  in  a  document.  The  rank  correlation  coefficients 
are  0.1  and  -0.412  respectively  (see  Table  20  and  2l). 
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Rank  correlation  of  document  length  (number  of  words  in  the 
document)  to  the  number  of  indexing  terms  assigned  by  FAST 
for  sample  No.  2  documents  (compare  also  Annex  IV). 


Table  19.  Rank  correlation  of  document  length  (number  of  words  in  the 
document)  and  the  number  of  indexing  te-  ~  assigned  by  the 
FAST  for  sample  No.  3  documents  (compare  also  Annex  IV). 


Document  Length 
Rank  (Xj) 

Index  Length 
Rank  (Y-) 

Q. 

ii 

X 

-< 

d,2 

1 

2 

1 

1 

2 

6 

4 

16 

3 

5 

2 

4 

4 

17 

13 

169 

5 

4 

1 

1 

6 

8 

2 

4 

7 

18 

11 

121 

8 

3 

5 

25 

9 

16 

7 

49 

10 

24 

14 

196 

1 1 

1 

10 

100 

12 

20 

8 

64 

13 

25 

12 

144 

14 

10 

14 

196 

15 

23 

8 

64 

16 

15 

1 

1 

17 

21 

4 

16 

18 

28 

10 

100 

19 

13 

6 

36 

20 

11 

9 

81 

21 

9 

12 

144 

22 

26 

4 

16 

23 

14 

9 

81 

24 

22 

2 

4 

25 

19 

6 

36 

26 

30 

4 

16 

27 

12 

15 

225 

28 

27 

1 

1 

29 

29 

0 

0 

30 

7 

23 

529 

J]  -  2,440 

r 1  * 


0.457 
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Table  20.  Rank  correlation  of  document  length  (number  of  words  in  the 
document)  and  the  ratio  of  the  number  of  indexing  terms  to 
the  number  of  words  per  document  for  sample  No.  2  documents 
(Compare  also  Annex  IV). 


Document  Length 

Index  Length 

Rank  (Xj) 

Rank  (Zj) 

d  -Xj-Zj 

—  i •  Rank  correlation  of  document  length  (number  of  words  in  the 
documents)  and  the  ratio  of  the  number  of  indexing  terms  to 
the  number  of  words  per  document  for  sample  No.  3  documents 
(Compare  also  Annex  IV). 


11.7.  INDEXING  CONSISTENCY  TESTS 

Two  types  of  consistency  tests  were  made:  inter-indexer  and  intra¬ 
indexer  consistency  tests.  In  this  context,  machine  and  author  are  considered 
"indexers". 

The  purpose  of  the  inter- indexer  consistency  tests  was  to  investigate 
the  variation  in  the  choice  of  indexing  terms  between  two  or  more  indexers 
(including  author  and  machine)  taken  at  a  time.  Six  experienced  indexers  were 
given  the  same  four  documents  to  index.  There  was  no  communication  among  the 
indexers.  They  were  not  permitted  to  discuss  the  documents  they  indexed  or 
to  compare  the  terms  they  assigned.  The  documents  were  also  indexed  independently 
by  the  authors  of  the  documents,  and  automatically  by  the  FAST  method.  No  effort 
was  made  to  evaluate  how  good  or  bad  were  single  indexing  terms  selected  by 
the  indexers,  author  or  machine,  since,  in  the  investigator's  opinion,  there  are 
no  absolute  and  generally  acceptable  criteria  for  such  an  evaluation  (assuming 
that  the  indexer:  possess  the  necessary  amount  of  competence  in  their  field). 
Consequently,  the  comparison  was  made  on  purely  formal  grounds. 

The  inter-indexer  consistency  coefficient  was  defined  as  the  ratio 
of  the  number  of  terms  which  are  common  to  a  group  of  n  individually  recognizable 
indexers  to  the  total  number  of  different  terms  selected  by  these  indexers. 

For  a  combination  of  n  indexers*)  at  a  time,  the  Inter-indexer  consistency 

i 

coefficient  is: 

We  remind  again  that  the  author  and  machine  are  also  "Indexers"  for  the 
purpose  of  this  study. 
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An) 


£  T(Aj)  T(AjAj)  +  £T(A,AjAkk..+(-l)  T(A,A2.  .  An) 
i  ij  ijk 


where 


T(A]A2. ..An)  is  the  number  of  terms  used  by  the  indexers  Aj,  A2,...An  in  common; 

T (A j )  -  number  of  terms  assigned  to  the  document  by  the  indexer  Aj,  i  =  l,2,...n; 
T(AjAj)  -  number  of  terms  used  by  two  indexers  Aj  and  Aj  (i,  j  *  l,2,...n,Mj) 
i n  common ; 

T(AjAjAk)-  number  of  terms  used  by  three  indexers  A.,  Aj  and  A^  in  common  etc. 
and 


means  the  sum  over  a 


11 


•j 


e-.ll 

means  the  sum  over  all 
ijk 


i  and  j ,  with  i  4  j 

i,  j,  k  with  no  two  of  them  equal,  and 


so  on. 

Obviously,  If  the  indexers  would  all  assign  the  same  terms  to  a 
given  document,  then 

T(A|A2.  .  .An)  -ET(Aj)  -  £  T(AjAj)+  .  .  .  +  (-l)n’,T(A,A2. .  .A„) 

1  ij 

and 

£■ 1 

On  the  other  hand,  if  the  indexers  would  produce  such  sets  of  terms, 
that  no  elements  (terms)  were  common  for  these  sets,  then 
T(A,A2...An)  -  0 
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and  consequently 


As  already  mentioned,  the  indexers  worked  independently  of  one 
another  and  no  consultation  was  permitted.  Aside  from  the  requirement  that 
the  documents  should  be  indexed  by  the  Uniterm  method  in  order  to  be  comparable 
with  the  machine  generated  indexes,  there  were  no  other  restrictions  imposed 
on  the  indexers:  the  indexers  were  not  bound  to  a  pre-established  vocabulary, 
neither  were  they  limited  in  the  amount  of  indexing  terms  per  document.  Since 
the  indexes  created  by  the  authors  of  the  documents  were  not  strictly  Uniterm 
indexes,  they  were  converted  to  Uniterm  by  breaking  up  pre-coordinated  terms. 
The  four  documents  which  were  thus  indexed  and  the  indexes  evaluated  for 
inter- indexer  consistency,  are  reproduced  in  Annex  V,  Items  1  -  4. 

Tables  in  I  terns  1  through  4,  Annex  VI,  list  the  terms  selected  by 
various  indexers,  authors  and  machine,  and  the  table  in  Annex  VII  gives 
consistency  coefficients  for  various  combinations  of  indexers  for  the  four 
sample  documents. 

Table  22  gives  the  average  values  of  consistency  coefficients  for 
the  same  four  sample  documents  for  different  sizes  of  indexer  groups.  These 
figures  reveal  two  very  significant  facts:  (l)  the  substitution  of  the  machine 
(FAST)  for  an  experienced  indexer  does  not  significantly  affect  the  inter¬ 
indexer  consistency;  the  inter- indexer  consistency  of  a  group,  one  element  of 
which  is  machine,  rapidly  approaches  the  inter-indexer  consistency  of  a  group 
of  all-human  Indexers  with  the  increasing  number  of  elements  (indexers)  In  the 
group  whose  products  are  compared.  Furthermore,  the  figures  in  Table  22  show 
that  the  inter-indexer  consistency  is  in  all  the  cases  higher  if  one  human 
indexer  is  substituted  by  the  machine  (FAST)  then  when  he  is  substituted  by 


the  author.  This  can  be  considered  a  satisfactory  proof  of  the  adequacy  of 
FAST  indexing  in  comparison  with  human  indexing. 

Comparison  was  also  made  of  variances  for  the  set  of  data  pertaining 
to  combinations  of  two  indexers  or  to  indexer  and  machine,  or  indexer  and  author. 
These  variances  are  given  in  Table  23.  The  average  values  of  variances  for 
all  four  sample  documents  are: 

-i* 

Two  indexers  100.52  x  10 

Indexer  -  Author  226.52  x  10 

Indexer  -  Machine  28.26  x  10 

The  above  figures  indicate,  within  certain  confidence  limits,  the 
important  fact,  that  the  deviations  from  the  mean  consistency  values  are 
smaller  when  the  sets  of  indexing  terms  produced  by  a  human  indexer  are 
compared  with  corresponding  sets  produced  by  the  FAST  program  than  they  are 
when  sets  of  indexing  terms  produced  by  one  human  indexer  are  compared  with 
those  produced  by  another.  In  turn,  the  deviations  from  the  mean  consistency 
values  are  smaller  for  two  human  indexers  than  indexer  and  author  comparisons. 

In  other  words,  there  are  less  drastic  differences  in  selectina  indexing  terms 
for  given  documents  between  an  indexer  and  the  FAST  program  than  between  any 
two  experienced  indexers  or  between  indexer  and  author. 

In  many  practical  cases  the  intre- indexer  consistency  is,  however, 
even  more  important  than  the  inter-indexer  consistency.  For  the  purpose  of 
this  study,  the  intra-indexer  consistency  is  defined  as  the  amount  of 
consistency  and  reliability  in  selecting  indexing  terms  when  the  same  indexer 
re-indexes  the  same  document  after  certain  period  of  time.  The  time  period 
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ib 1 e  22,  Average  values  of  consistency  coefficients  for  indexer 
group  sizes  two  through  six  for  four  sample  documents. 


Any  2  indexers 

0.453 

One  indexer  &  machine 

0.392 

One  indexer  &  author 

0.350 

Any  3  indexers 

0.307 

Any  2  indexers  &  machine 

0.265 

Any  2  indexers  &  author 

0.213 

Any  4  indexers 

0.232 

Any  3  indexers  &  machine 

0.207 

Any  3  indexers  &  author 

0.163 

Any  5  indexers 

0. 187 

Any  4  indexers  &  machine 

0.170 

Any  4  indexers  6  author 

0.133 

Any  6  indexers 

0.158 

Any  5  indexers  &  machine 

0.144 

Any  5  indexers  &  author 

0.114 

chosen  was  two  months  in  order  to  reduce  to  a  great  extent  the  memory  effects. 
The  same  four  sample  documents  were  used  for  the  intra-indexer  consistency 
test.  Since,  however,  two  of  the  indexers,  who  indexed  the  documents  the 
first  time,  were  no  longer  available  for  the  re-indexing,  only  the  results 
of  four  indexers  were  compared  and  evaluated.  The  test  conditions  for  the  re¬ 
indexing  were  the  same  as  for  the  original  indexing.  Tables  In  Annex  VII 
Items  1  through  4,  show  the  indexing  terms  selected  by  the  four  indexers  and 
by  the  FAST  program  during  the  first  indexing  round  and  during  the  re- indexing 
round.  The  indexing  terms  selected  during  the  first  round  are  checked  "x"  and 
those  selected  when  the  documents  were  re- indexed  are  checked  by  "0".  Terms, 
which  were  picked  both  times,  are  checked  by 
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2V2 

Table  23.  Variances  0  of  inter-indexer  consistency  coefficients 
for  four  sample  documents. 


Document 

TSR  No.  1 

Document 

TSR  No.  2 

Document 

TSR  No.  3 

Document 
TSR  No.  5 

Two  indexers 

51.65x10"** 

126.50x10“** 

183.03x10"** 

40.92x10“** 

lr«dexer-Author 

678.01x10“** 

55.25x10“** 

1 12.45xl0-4 

60.38x10"** 

Indexer-Machine 

55.99x10“** 

39.29x10“** 

13.81x10“** 

32.24x10“** 

The  intra-indexing  coefficient  is  defined  as  the  ratio  of  the 
number  of  identical  terms  selected  by  the  same  indexer  both  first  and  second 
time  to  the  total  number  of  different  terms  used  by  the  indexer. 


Thus 


'or 


Tr  -  T, 


or 


where 

Tor  ■  number  of  same  terms  which  have  been  used  by  the 
indexer  both  when  indexing  a  document  first  time 
and  re- indexing  the  same  document  after  a  lapse 
of  time. 

T0  ■  number  of  terms  assigned  by  the  indexer  when  the 
document  was  indexed  first  time. 

Tr  «  number  of  terms  assigned  by  the  indexer  when  the 
document  was  re- indexed. 


Obviously,  If  in  re-indexing  the  document,  an  indexer  would  not 
assign  any  of  the  terms  which  he  had  assigned  to  the  document  when  indexing  it 
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for  t  i  first  time,  Tor  would  be  equal  to  zero  and  also  the  :ntra-indexer 


consistency  coefficient  would  be  equal  to  zero.  On  the  other  extreme, 
if  an  indexer  would  use  exactly  the  same  terms  when  re-indexing  the  document 
as  he  did  the  first  time,  then  Tor  =  T0  +  Tr  -  Tor  and  the  coefficient  would 
be  equal  to  1 . 

Table  24  gives  the  intra- indexer  coefficients  for  four  human 
indexers  and  for  the  machine  (FAST)  calculated  for  each  of  the  four  sample 
documents  indexed. 

Table  24.  Intra- indexer  consistency  coefficients  for  four  sample 
documents . 


Document 
TSR  No.  1 

Document 
TSR  No.  2 

Document 

TSR  No.  3 

Document 
TSR  No.  5 

Indexer  No.  1 

0.750 

0.643 

0.765 

0.642 

Indexer  No.  4 

0.591 

0.706 

0.652 

0.571 

Indexer  No.  5 

0.500 

0.666 

0.600 

0.590 

Indexer  No.  6 

0.687 

0.750 

0.529 

0.933 

Machine  (FAST 

1.000 

1.000 

1.000 

1.000 

Program) 


The  average  Intra-indexer  consistency  for  all  indexers  and  all 
tests  was  0.661.  This  means  that  there  is  very  high  probability  that  fan 
indexer  will  assign  different  sets  of  Indexing  terms  to  one  and  the  same 
document  at  different  points  In  time  viz.  that  his  judgment  as  to  which  terms 
are  most  representative  of  the  contents  of  the  document  is  not  invariable, 
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but  varies  with  time.  Consequently,  this  results  in  a  certain  amount  of 
uncertainity  on  behalf  of  the  user  as  to  the  criteria  which  the  indexers 
apply  in  selecting  the  indexing  terms.  The  FAST  Program,  of  course,  performs 
always  with  100%  consistency. 


I 
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11.8.  CHANNEL  CAPACITY  AND  EFFICIENCY 

The  USE  1963  Subject  Index  and  the  indexes  produced  by  the  FAST 
method  for  samples  2  and  3  of  documents  were  also  analyzed  for  the  frequency 
and  the  distribution  of  postings  under  the  subject  terms.  The  table  in 
Annex  IX  gives  the  frequency  distribution  of  terms  according  to  the  number 
of  postings  or  entries  associated  with  these  terms  for  ILSE  1963  Subject  Index. 

Figure  14  is  a  plot  of  the  number  of  terms  against  the  frequency 
of  postings  for  each  term  group  on  the  logarithmic,  scale  paper  for  the  same 
ILSE  index.  It  can  be  noted  that  the  plot  in  its  general  trend  is  somewhat 
similar  to  the  Zipf-Mandelbrot  curve  of  a  log-log  plot  of  word  frequency 
versus  word  rank/")  However,  because  of  much  greater  spread  of  single  points, 
the  difference  is  significant  enough  to  prevent  conclusion  that  the  frequency 
of  words  as  a  function  of  word  rank  is  equivalent  to  the  frequency  of  post¬ 
ings  as  a  function  of  term  rank. 

Houston  and  Wall  ( 1 50)  published  statistics  on  ten  indexed  collections 
and  plotted  cumulative  distributions  of  postings  in  these  collections.  They 
found  that  all  these  distributions  are  nearly  log-normal.  The  plot  is  re¬ 
produced  below  on  Figure  15.  For  the  purpose  of  comparison,  the  cumulative 
distribution  of  postings  for  the  ILSE  1963  system  is  also  entered  on  the 
plot.  Obviously  the  latter  distribution  follows  the  same  general  pattern 


,V)  Zipf,  G.  K. ,  1949,  Human  Behavior  and  the  Principle  of  Least  Effort. 
Add  I sion -Wes  ley  Co.,  Inc.,  Cambridge,  Hass. 
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Figure  14:  A  plot  of  the  number  of  terms  versa*  frequency  of  postings 
on  logarithmic  scale  for  USE  1963  i'ubject  Index. 
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Figure  1$:  Cumulative  distribution  of  postings  per  term  for  collections 
reported  by  Houston  and  Well  and  for  the  USE  1963  Index. 


as  the  ones  reported  by  Houston  and  Walt,  v<2.  could  be  considered  as 
belonging  to  the  class  of  log-normal  distributions. 


Obviously,  indexes  can  be  considered  as  channels  for  transmitting 
information  from  the  store  to  the  user.  As  such  they  can  be  formally 
evaluated  by  the  information  theory  methods.  This  approach  was  suggested  and 
investigated  in  greater  detail  by  the  author  (28l). 


The  overall  efficiency  of  an  index  as  an  information  channel  can 
be  expressed  by  the  efficiency  coefficient 

Ti-7?!  -Hr 


where  7^  is  the  efficiency  coefficient  measuring  the  specificity  of  the 
information  retrieved  or  the  information  content  of  the  indexing  terms  and 
is  the  efficiency  coefficient  measuring  the  retrievebi 1 i ty  and  recall 
in  terms  of  the  operational  economy  of  the  system. 

The  coefficient  is  obtained  from  the  equation 


n 


i 


In  P| 


where 

Pj  Is  the  probability  of  occurrence  of  the  term  in  the  system 
or  its  relative  frequency  obtained  as  the  ratio  of  the  number  of  postings 
under  this  term  to  the  total  number  of  postings  in  the  collection  and  C|  - 
channel  capacity  if  the  criterion  is  index  specificity. 


In  this  case 


C 


I 
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■"here  P  is  the  total  ntxrvber  of  postings  ir  the  index. 


a 


in  a  similar  way,  the  coefficient 


£  g:  In 


i  Ji 


where 


g.  is  the  probability  of  a  term  having  i  number  of  postings 
(i  =  1,2,3... .n)  or  its  relative  frequency  obtained  as  a  ratio 
of  the  number  of  terms  with  i  number  of  postings  to  the  total 
number  of  terms 


Table  25.  Distribution  of  postings  in  indexes  for  sample  No.  2  and 
No.  3  documents  and  collections. 


Sample  No.  2 


Sample  No.  3 


Number  of  Number  of 

Postings  Indexing  Terms 


Number  of  Number  of 

Postings  Indexing  Terms 


and 


where 


-  channel  capacity  if  the  criterion  is  retr ievabi 1 i ty 
based  on  operational  economy.  In  this  case 
Cft  =  g  .In  g  -  (g-l)  .  ln(g-l) 


"g  is  the  average  number  of  postings  per  term. 


From  the  data  given  in  the  Table  25  we  obtain. 


Sample  No.  2 


£p  In  p.  ■  -  5.8343 


ds--  6-‘*393 

-  0.906 

6.4393 


Sample  No.  3 


7  p.  In  p.  ■  -  5.399 

i  i 

C  «  In  JL  *  *  6.4785 
I  651 

T(  -  iiiSi-  -  0.8334 
M  6.4785 


Sample  No.  2  ^  9j  In  9i  *  0.9288 

CR  -  g  ini  -  Cg-0  In  (g-l)  - 
-  1.5194  In  1.5194  -  0.5194  ln  0.5194  - 


-  0.9755 

7)R.  0^288. 

«  R  0.9755 


0.9521 


Sample  No.  3 


£  g j  In  g.  =  I . khhk 

CR  -  2.1845  In  2.1845  -  1.1845  In  1.1845 
=  2. 1845  (7.6889  -  6.9077)  + 

-  M84*»  (/ .  J/t>6  -  6.9077)  * 

=  1.7065  -  0.2001  = 

=  1.5065 

R  1.5065 


1  4444 

7}0  -  44???-  0.9588 


Thus  we  finally  obtain  the  overall  efficiency  coefficient  for  Sample  No.  2 


=  7) , •  7)  R  =  0.906  x  0.9521  =  0. 
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and  for  Sample  No.  3 

T\  *7]|*7Jr  -  0.8334  X  0.9588  -  0.7991 


For  the  ILSE  1 963  Subject  Index,  produced  by  human  indexers,  the  correspond¬ 
ing  coefficients  are 

7|  -  0.6095 
7Jr  -  0.731* 

and  thus  the  overall  efficiency  coefficient  is 


7J-  7J  .  7J  "  0»6095  x  0.7314  -  0.4458 


Although  the  samples  of  the  indexes  produced  by  the  FAST  method,  which  were 
here  investigated,  were  small  to  justify  far  reaching  conclusions,  they  never¬ 
theless  indicate  that  such  FAST  indexes  comn--i  in  efficiency  very  favorably 
to  indexes  produced  by  human  indexers  and  that  there  is  good  reason  to  believe 
that  they  need  less  optimizing  than  the  ones  produced  by  the  humans. 
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ANNEX  1,  ITEM  1 


AGENCY-  NASA  ARC 


ZZ 


TASK  -  STUOY  OF  LONG-TERM  EFFECTS  OF  LOW  G-LOAOING  OF  MAMMALS 

PRIN  INV-  DURATION-  /  / 

TQ  STUDY  THE  EFFECTS  OF  LONG-TERM  EXPOSURE  TO  AN  ALTERED  G 
ENVIRONMENT  /BY  CENTRIFUGATION/  OF  VARIOUS  MAMMALS  INCLUDING  MICE* 
RATS-  PHYSIOLOGIC  AND  BIOCHEMICAL  EFFECTS  WILL  8E  MEASURED  TO 
DELINEATE  THOSE  RESPONSES  WHICH  ARE  G-RESPGNSIVE.  CONTROL  DATA  AS 
WELL  AS  TEST  ANIMAL  DATA  WILL  ULTIMATELY  BE  APPLIED  TO  SETTING  UP 
SPECIFIC  EXPERIMENTS  FOR  SUSTAINED  ZERO  G  STUDIES.  ADAPTIVE  CHANGES 
IN  THE  HOMEOSTATIC  PROCESSES  WILL  BE  FOLLOWED  IN  SUPRA  ONE  G  ADAPTED 
ANIMALS  WHEN  THEY  ARE  RETURNED  TO  NORMAL  G  ENVIRONMENT. 

INTRACELLULAR  EFFECTS  OF  SUSTAINED  G  LOAOING  WILL  BE  STUOIED 
PARTICULARLY  CHANGES  IN  FAT  AND  CARBOHYDRATE  METABOLISM  OF 
MITOCHONDRIA  AND  PROTEIN  METABOLISM  OF  ISOLATED  MICROSOMAL  FRACTIONS 
ALTERATIONS  IN  BLOOD  ANO  TISSUE  ISOENZYMES  WILL  BE  STUOIED. 

METABOLIC  STUDIES  BOTH  AT  THE  WHOLE  ANIMAL  LEVEL  AS  WELL  AS  THE 
TISSUE  AND  CELLULAR  LEVELS  WILL  BE  FOLLOWED  WITH  LABELED  SUBSTRATES. 
PROCESSES  INVOLVEO  IN  AOAPTING  ANIMALS  TO  G-LOADS  GREATER  THAN  ONE 
G  HILL  BE  STUDIED  AS  WELL  AS  THE  REVERSE  PROCESS  IN  SUPRA  ONE  G 
AOAPTEO  ANIMALS. _ 


ANNEX  ).  ITEM  2 
AGENCY-  NASA  ARC 


ZZ 


TASK  -  NEUROHORMONAL  STUOIES  AS  RELATED  TO  SPAQE  FLIGHT  STRESSES 

PRIN  INV-  DURATION-  /  /  - 

NEUROHORMONAL  ASPECTS  OF  BRAIN  MECHANISMS  AND  STRESS.  t\/  VO  IDENTIFY 
THE, NSURQHQRMONE  FROM  THE  HYPOTHALAMUS  WHICH  RELEASES  ACTH  FROM  THE 
PITUITARY.  EVIOENCE  SO  FAR  INDICATES  THAT  THIS  IS  VASOPRESSIN  /AOH/. 
til  TO  ASSAY  VASOPRESSIN  IN  BRAIN  TISSUE*  IN  JUGULAR  BLOOO  AND  IN 
C-S  FLUID  IN  ANIMALS  UNDER  VARIOUS  PHYSIOLOGICAL  AND  UNPHYSIOLOGICAL 
CONDITIONS  SUCH  AS  PHYSICAL  ANO  PSYCHOLOGICAL  STRESSES.  /3/  TO 
INVESTIGATE  THE  MECHANISMS  BY  WHICH  VASOPRESSIN  IS  RELEASED  FROM  THE 
HYPOTHALAMUS  UNDER  STRESS  AND  ROLE  OF  VASOPRESSIN  IN  THE  SYNTHESIS 
AND  DEGRADATION  OF  ACTH  /WITH  OR.  STANLEY  ELLIS/.  /4/  TO  MEASURE 
ADRENAL  STEROIDS  ANO  CATECHOLAMINES  IN  CLOOO  £  URINE  IN  ANIMALS 
AND  MAN  UNOER  STRESS  CONDITIONS. 

SUBJECTING  THE  ORGANISM  /INCLUDING  MAN/  TO  UNDUE  STRESS  SUCH  AS  VMS 
PHYSICAL  STRESS  OF  ACCELERATION,  DECELERATION*  WEIGHTLESSNESS, 
VIBRATION  ANO  RAOIATION  AND  TO  PSYCHOLOGICAL  AND 

PHYSIOLOGICAL  STRESSES  SUCH  AS  CON  1  'SMENT  IN  A  SATELLITE,  ANXIETY* 
DISTURBANCES  IN  SLEEP  AND  BIOLOGICAL  RHYTHMS*  FATIGUE,  PAIN  AND  OTHE,. 
BODILY  DISCOMFORTS,  MAY  SEVERELY  CHALLENGE  THE  H0HE0STA7.C  MECHANISMS 
OF  THE  BODY.  IT  IS  IMPORTANT  TO  KNOW  VMAT  HAPPENS  TO  MAN  IF  The 
HIGHER  OR  LOWER  LIMITS  OF  THESE  REGULATORY  MECHANISMS  ARE  PASSED  OVER. 
CALLING  ATTENTION  TO  THESE  FUNCTIONS  SERVES  TO 

EMPHASIZE  THE  IMPORTANCE  OF  STUOYING  THE  ••TRIGGER**  MECHANISM  lx  THE 
HYPOTHALAMUS  WHICH  GAVE  OUT  THE  EARLIEST  SIGNAL  OF  STRESS. 
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terms  to  the  number  of  words  in  the  documents  of  Sample  No.  2.  Figures 
in  brackets  show  the  ranks. 
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Word  counts  by  document  with  corresponding  numbers  of 
indexing  terms  assigned  by  the  FAST  program  and  ratios  of  the  number 
of  indexing  terms  to  the  number  of  words  in  the  documents  of  Sample 
No.  3.  Figures  in  brackets  show  the  ranks. 
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0.192 
0.250 
0.304 
.0.272 
0. 162 
0.225 
0.256 
0.184 
0.278 
0.216 
0. 194 
0.242 
0.216 
0.235 
0.226 
0.237 
0.189 
0.281 
0.132 
0.228 
0.242 
0.176 
0.258 
0.310 
0.187 
0.239 
0.214 
0.262 
0.250 
0.179 
0.204 
0.182 
0.204 
0.195 
0.225 
0.237 
0.237 
0.220 
0.154 
0.263 
0.194 
0.147 
0.206 
0.194 
0.200 
0.135 
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2-4-A 

2-5-A 

2- 6-A 

3- 4-A 

3-5-A 

3- 6-A 

4- 5-A 

4- 6-A 

5- 6-A 

1 -2-3-4 
1-2-3-5 
1 - 2-3-6 
1 -2-4-5 
1 -2-4-6 
1 -2-5-6 
l -3-4-5 
1 -3-4-6 
1 -3-5-6 

1 - 4-5-6 

2- 3-4-S 
2-3-4-6 

1- 3-5-6 

2- 4-5-6 

3- 4-S-6 
1-2-3-M 
1-2-4-M 
1  —  2-5-M 
1-2-6-M 
1-3-4-M 
1-3-5-M 
1-3-6-M 
1-4-5-M 
1-4-6-M 

1- 5-6-M 

2- 3-4-M 


2-3-5-“ 


2-4-6-M 


0.235 

0.230 

0.235 

0.192 

0. 181 

0.260 

0.275 

0.250 

0.275 

0.208 

0.250 

0.285 

0.296 

0.238 

0.291 

0.333 

0.240 

0.294 

0.343 

0.250 

0.382 

0.193 

0.366 

0.285 

0.294 

0.259 

0.354 

0.333 

0.303 

0.280 

0.333 

0.230 

0.407 

0.318 

0.344 

0.250 

0.370 

0.350 

0.342 

0.200 

0.375 

0.259 

0.323 

0.206 

0.323 

0.269 

0.240 

0.378 

0.181 

0.297 

0.241 

0.333 

0.193 

0.306 

0.250 

0.323 

0.214 

0.361 

0.200 

0.323 

0.214 

0.285 

0.230 

0.333 

0.304 

0.285 

0.240 

0.368 

0.187 

I - 2-3 -A 
-2- 


0.365 

0.307 

0.300 

0.315 

0.275 

0.324 

0.342 

0.297 

0.305„ 

0.200 

0.171 

0.171 

0.176 


0.147 

0.193 

0.193 

0.250 

0.206 

0.166 

0.222 

0.172 

0.240 

0.193 

0.222 

0.178 

0.240 


0.178 

0.148 

0.206 

0.200 

0.208 

0.250 

0.171 

0.250 

0.176 

0.200 

0.151 

0.156 

0.241 

0.142 

0.129 

0.135 

0.14? 

0.108 

0.138 

0.088 


0.187 
0.170 
0. 145 
0.155 
0.217 
0.130 
0.186 
0.130 
0. 162 
0.116 
0.140 
0.117 
0.125 
0.140 
0.135 
0.147 
0.108 
0.111 
0.088 
0.077 
0.147 
0. 181 
0.151 
0.156 


0.132 
0.114 
0.182 
0.118 
0. 100 
0.143 
0.118 
0.125 
0.120 
0.167 
0.143 
0.175 
0.122 
0.179 
0.158 
0.125 
0.162 
0.194 
0.135 
0.095 
0.158 
0.189 


0.167 

0.149 

0.149 

0.130 

0.163 

0.159 

0.163 

0.178 

0.119 

0.171 

0.130 

0.152 

0.159 

0.109 

0.091 

0.182 

0.140 

0.146 

0.220 

0.143 

0.122 

0.125 

0.077 

0.162 
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1-3-4-A 

0.193 

0.240 

0.161 

0. 105 

1-3-5-A 

0.193 

0.192 

0.142 

0.079 

1-3-6-A 

0.200 

0.260 

0.148 

0.114 

1-4-5-A 

0.172 

0.217 

0.161 

0.102 

1-4-6-A 

0.214 

0.315 

0.166 

0.111 

1-5-6-A 

0. 185 

0.263 

0.153 

0.091 

2-3-4-A 

0.222 

0.200 

0.138 

0.098 

2-3-5-A 

0.222 

0. 156 

0. 108 

0.077 

2-3-6-A 

0.200 

0.206 

0.111 

0.108 

2-4-5-A 

0.194 

0.172 

0.166 

0.050 

2-4-6-A 

0.200 

0.230 

0.171 

0.105 

2-5-6-A 

0.17! 

0. 192 

0.086 

3-4-5-A 

0.225 

0. 178 

ism 

0.054 

3-4-6-A 

0.233 

0.250 

0.086 

3-5-6-A 

0.200 

0.200 

0.100 

0.094 

4-5-6-A 

0.214 

0.227 

0. 151 

0.059 

I-2-3-4-5 

0. 277 

o'.  193 

0.135 

0.091 

1  -  2-3 -4-6 

0.333 

0.250 

0. 147 

0.119 

1-2-3-5-6 

0.285 

0.193 

0.108 

0.143 

1-2-4-5-6 

0.285 

0.259 

0.111 

0.098 

1-3-4-5-6 

0.333 

0.230 

0.088 

0.125 

2-3-4-5-6 

0.305 

0.200 

0.078 

0.098 

1-2-3-4-M 

0.307 

0.181 

0.140 

0.122 

1-2-3-5-M 

0.317 

0. 138 

0.120 

0.122 

1 -2-3-6-M 

0.282 

0. 181 

0.125 

0.125 

1 -2-4-5-M 

0.250 

0.187 

0. 120 

0.104 

1 -2-4-6-M 

0.289 

0.241 

0. 106 

0.085 

1-2-5-6-M 

0.250 

0.  193 

0. 104 

0.125 

1-3-4-5-M 

0.270 

0.161 

0. 102 

0.109 

1-3-4-6-M 

0.314 

0.214 

0.130 

0.114 

1  -3-5-6-M 

0.270 

0. 161 

0.087 

0.159 

1-4-5-6-H 

0.277 

0.230 

0.087 

0.111 

2-3-4-5-H 

0.285 

0.142 

0.094 

0.083 

2-3.4-6-M 

0.300 

0. 187 

0.080 

0.087 

2-3-5-6-M 

0.261 

0. 147 

0.152 

2-4-5-6-H 

0.268 

0.193 

0. 100 

0.087 

3-4-5-6-M 

0.289 

0. 166 

0.061 

0.136 

1 -2-3-4-A 

0. 162 

0.193 

0.138 

0.093 

1 -2-3-5-A 

0. 162 

0.147 

0.  108 

0.068 

1 - 2-3-6-A 

0.166 

0.193 

0.111 

0.098 

1 -2-4-5-A 

0.135 

0.166 

0.138 

0.048 

1- 2-4-6-A 

0.166 

0.222 

0. 142 

0.100 

1- 2-5-6-A 

0.  ’38 

0.178 

0.114 

0.077 

1 -3-4-5-A 

0.151. 

0. 172 

0.117 

0.049 

1 -3-4-6-A 

0.187 

0.240 

0.121 

0.077 

1 -3-5-6-A 

0.156 

0. 135 

0. 100 

0.081 

1 -4-5-6-A 

0.166 

0.217 

0. 121 

0.053 

2-3-4-5-A 

0. 184 

0.151 

0. 102 

0.046 
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2-3-4-6-A 

0. 189 

0.200 

2-3-5-6-A 

0. 162 

0.156 

2-4-5-6-A 

0. 162 

0.172 

3-4-5-6-A 

0. 18Z. 

0.178 

1-2-3-4-5-6 

0.270 

0.193 

1  -2-3-4-5-M 

0.238 

0.138 

1  -2-3-4-6-M 

0.275 

0.181 

1-2-3-5-6-M 

0.238 

0.138 

1  -2-4-5-6-M 

0.243 

0.187 

1-3-4-5-6-M 

0.263 

0.161 

2-3-4-5-6-M 

0.255 

0.142 

1-2-3-4-5-A 

0.128 

0.147 

1  -2-3-4-6-A 

0.157 

0.193 

i -2-3-5-6-A 

0.131 

0.147 

1  -2-4-5-6-A 

0.166 

0.166 

1-3-4-5-6-A 

0.147 

0.172 

2-3-4-5-6-A 

0.153 

0.151 

0.105 

0.0 

73 

0.077 

0.0 

77 

0.131 

0.0 

50 

0.083 

0.0 

54 

0.077 

0.094 

0.080 

0.078 

0.080 

0.061 

0.057 

0.102 

0.105 

0.077 

0.105 

0.083 

0.073 

0.091 

0.080 

0.082 

0.122 

0.083 

0.109 

0.083 

0.044 

0.070 

0.070 

0.048 

0.049 

0.046 
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Indexing  terms  assigned  by  human  indexers  and  by  the  FAST 
program  to  sample  document  TSR  No.  1  in  intra-indexing  consistency 


test. 


Term 

Indexer 

#1 

1 ndexer 
#4 

1 ndexer 
#5 

Indexer 

#6 

Machine 

(FAST) 

ATTENUATION 

© 

0 

X 

© 

CHARGE 

© 

COLLISION 

© 

0 

X 

0 

© 

COMPRESSION 

© 

© 

© 

© 

© 

CUTOFF 

0 

X 

DENSITY 

O 

© 

X 

X 

© 

DISPERSION 

© 

EXCHANGE 

© 

FIELD 

© 

0 

© 

© 

© 

FREQUENCY 

© 

© 

© 

© 

© 

FUNCTION 

© 

0 

0 

© 

HYDROGEN 

© 

© 

© 

© 

© 

HYDROMAGNETICS 

© 

© 

© 

© 

HYDROMAGNETISM 

© 

IMPULSE 

X 

X 

© 

ION 

0 

© 

0 

© 

IONIZATION 

© 

© 

© 

© 

© 

LINEARITY 

© 
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ANNEX  VIII,  ITEM  1  (Cont.) 


1 

s' 

[ 

l 

f, 


Term 

1 ndexer 
#1 

Indexer 

#4 

Indexer 

#5 

1  ndexer 
#6 

Machine 

(FAST) 

MAGNETISM 

© 

© 

© 

© 

© 

MASS 

© 

© 

MEASUREMENT 

X 

© 

NEUTRAL 

0 

- 

0 

PLASMA 

© 

© 

© 

© 

© 

PLASMA-FILLED 

X 

PROPAGATION 

© 

0 

X 

© 

RANGE 

© 

RESISTIVITY 

© 

X 

© 

© 

© 

RESPONSE 

X 

X 

© 

SPEED 

© 

TEMPERATURE 

© 

THEORY 

© 

THERMAL 

© 

TRANSFER 

0 

0 

0 

© 

TRANSFORMER 

X 

©' 

WAVE 

© 

© 

© 

© 

© 

WAVEGUIDE 

© 

© 

© 

© 

© 

COLD 

0 

CONSISTENCY  75%  59. 1%  50%  68.7%  100% 
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ANNEX  VIII.  ITEM  2 


Indexing  terms  assigned  by  human  indexers  and  by  the  FAST 
program  to  sample  document  TSR  No.  2  in  intra-indexing  consistency 
test. 


Term 

1 ndexer 
!  #1 

1 ndexer 
#4 

■ 

Indexer 

#5 

1 ndexer 
#6 

Machine 

(FAST) 

ANALYSIS 

0 

0 

© 

BEHAVIOR 

© 

BOUNDARY 

© 

© 

© 

© 

© 

COLD 

0 

© 

COMPUTER 

© 

0 

© 

CONDITION 

X 

X 

0 

COUPLING 

© 

© 

DIELECTRICS 

© 

© 

© 

© 

© 

FREQUENCY 

0 

i 

© 

© 

GROUP 

1 

© 

HYDROMAGNETICS 

© 

© 

© 

© 

HYDROMAGNETI SM 

© 

LOW 

© 

MAGNETI SM 

© 

0 

© 

OBSERVATION 

© 

ORTHOGONALITY 

© 

© 

172 


Term 


Indexer  Indexer 
#1  #4 


Indexer 

#5 


Indexer 

#6 


Mach i ne 
(FAST) 


PLASMA  @ 

PLASMA-FILLED 
PROPAGATION  x 

RELATION 

SHEATH  @ 

SOLUTION 

SURFACE  @ 

TENSOR  O 

TRANSVERSE  O 

WAVE  @ 

WAVEGUIDE  @ 

NUMBER 
MAGNETIZED 


CONSISTENCY  64.3% 


Indexing  terms  assigned  by 


program  to  sample  document  TSR  No.  3  in 
test. 


Term 

1 ndexer 
#1 

l 1 H  fgl*  k 

AMPLITUDE 

© 

ATTENTUATION 

© 

0 

BETA 

X 

CHARACTERISTIC 

COMPRESSION 

X 

© 

CRITICAL 

CURVE 

CYCLOTRON 

© 

© 

CYLINDER 

DECAYING 

DENSITY 

© 

© 

DIAGNOSIS 

DISSIPATION 

ELECTRON 

X 

FLUID 

FREQUENCY 

© 

© 

HYDROGEN 

0 

© 

HYDROMAGNET 1C 

© 

© 

ANNEX  VIM.  ITEM 


an  indexers  and  by  the  FAST 
intra-indexing  consistency 


Indexer 

#5 

Indexer 

#6 

Machine 

0  ! 

© 

0  ! 

X 

© 

© 

0 

© 

© 

© 

© 

© 

© 

© 

© 

0 

© 

© 

© 

© 

© 

© 

© 

© 

0 

© 

©  i 

© 

ANNEX  VII.  ITEM  3  (Cont.) 


Indexer  Indexer  Indexer  Indexer  Machine 

m  #4  #5  ns 


ANNEX  VIII.  ITEM 


Term 

1 ndexer 
#1 

: 

Indexer 

#4 

Indexer 

#5 

Indexer 

#6 

Mach i ne 

TOOL 

1 

© 

TUBE 

X 

© 

WAVE 

© 

© 

© 

© 

© 

WAVEGUIDE 

© 

© 

© 

© 

© 

CORRELATION 

0 

MAGNETIZED 

0 

CONSISTENCY 

76.5% 

65.2% 

60.0% 

52,9% 

100.0% 

176 


Indexing  terms  assigned  by  human  indexers  and  by  the  FAST 
program  to  sample  document  TSR  No.  5  in  intra- indexi ng  consistency 


test. 


Term 

Indexer 

#1 

Indexer 

#4 

Indexer 

#5 

1  ndexer 
#6 

Machine 

AMPLIFICATION 

X 

X 

© 

© 

AMPLIFIER 

O 

© 

© 

© 

© 

AMPLITUDE 

© 

© 

© 

APPROXIMATION 

© 

BROADENING 

X 

0 

© 

CHARACTERISTIC 

© 

COHERENCE 

© 

0 

© 

X 

© 

DENSITY 

0 

© 

DEPENDENCE 

© 

DOPPLER 

© 

© 

o 

© 

EFFECT 

© 

© 

i 

© 

ELECTROMAGNETIC 

X 

FIELD 

© 

0 

© 

FORMALISM 

© 

FREQUENCY 

© 

© 

© 

GAIN 

0 

© 

GAS 

© 

© 

GRAPH 

© 

77 


1 


ANNEX  V 1 1 1 ,  ITEM  4  (Cont.) 


Indexer  Indexer  Indexer  Indexer  Machine 
#1  #k  #5  #6 


ANNEX  VIII.  ITEM  4  (Cont 


Term 

I 

Indexer 

m 

; - 

1 ndexer 
#4 

Indexer 

#5 

Indexer 

#6 

Mach i ne 

SPECTRUM 

© 

© 

THEORY 

X 

© 

TIME 

© 

© 

TRAVEL 

© 

© 

TRAVELLING 

© 

© 

© 

VECTOR 

0 

© 

© 

WAVE 

© 

© 

© 

© 

© 

SPACE 

O 

CONSISTENCY  |  64.2%  |  57.1%  I  59.0%  I  93.3%  I  100.0% 


ANNEX  IX 


Frequency  distribution  of  terms  by  number  of  postings  for 
USE  1963  index. 


No.  of  Postings 

(Uj) 

No.  of  Terms 
F(uj) 

Relative  Frequency 
of  Terms  with  Uj 
Postings  f(uj) 

1 

1342 

0.4266 

2 

462 

0.1468 

3 

268 

0.0852 

4 

156 

0.0496 

5 

111 

0.0353 

6 

75 

0.0238 

7 

72 

0.0229 

8 

61 

0.0194 

9 

37 

0.0118 

10 

37 

0.0118 

11 

29 

0.0092 

12 

35 

0.0111 

13 

24 

0.0076 

14 

23 

0.0073 

15 

20 

0.0063 

16 

20 

0.0063 

17 

17 

0.0054 

18 

11 

0.0034 

19 

i 

0.0067 

20 

22 

0.0070 

21 

13 

0.0041 

22 

8 

0.0025 

23 

14 

0.0044 

24 

8 

0.0025 

25 

7 

0.0022 

26 

11 

0.0034 

27 

9 

0.0028 

28 

9 

0.0028 

29 

6 

0.0019 

30 

8 

0.0025 

31 

8 

0.0025 

32 

4 

0.0012 

33 

4 

0.00)2 

34 

9 

0.0028 

35 

4 

0.0012 

36 

3 

0.0009 

37 

2 

0.0006 
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No.  of  Postings 


No.  of  Terms  Relative  Frequency 

F(u j )  of  Terms  with  u j 

Postings  f (u j ) 


No.  of  Postings 
(u  |  ) 

No.  of  Terms 

FK) 

Relative  Frequency 
of  Terms  with  u; 

Postings  f(uj) 

191 

1 

0.0003 

194 

1 

0.0003 

197 

1 

0.0003 

218 

1 

0.0003 

221 

1 

0.0003 

248 

1 

0.0003 

250 

1 

0.0003 

267 

1 

0.0003 

277 

1 

0.0003 

288 

1 

0.0003 

295 

1 

0.0003 

316 

1 

0.0003 

317 

0.0003 

322 

1 

0.0003 

380 

1 

0.0003 

388 

0.0003 

396 

1 

0.0003 

426 

1 

0.0003 

569 

1 

0.0003  +. 

710 

1 

0.0003 

1.065 

1 

0.0003 

1,072 

1 

0.0003 

1,251 

1 

0.0003 

1,310 

1 

0.0003 

-U8J 

1 

0.0003 

E  -  37,471 


E  -  3,146 


COCKPIT  500070  500346  500372  500854  500H90  500893  500894  500896  500904  500941 
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HYDROMAGNETIC  WAVE  BOUNDARY  CONDITION  AND  A  SURFACE  WAVE  IN  A  PLASMA  FILLED 
WAVEGUIDE 

AUTHOR (S) 

SWANSON,  D.  G. 


ABSTRACT 


THE  ANALYSIS  OF  A  MAGNETIZED  PLASMA  IN  A  WAVEGUIDE  WITH  A  DIELECTRIC 
SHEATH  BETWEEN  THE  PLASMA  AND  WAVEGUIDE  IS  CONSIDERED.  WITHIN  THE  LIMITATIONS 
OF  THE  COLD  PLASMA,  EFFECTIVE  DIELECTRIC  TENSOR  APPROACH,  THE  PROBLEM  IS  SOLVED 
EXACTLY  AND  A  FEW  ILLUSTRATIVE  COMPUTER  SOLUTIONS  FOR  THE  BEHAVIOR  OF  THE 
TRANSVERSE  WAVE  NUMBER  ARE  PRESENTED.  ALSO,  SOME  APPROXIMATE  LOW  FREQUENCY 
EXPRESSIONS  ARE  DERIVED  FOR  THE  EFFECT  OF  THE  DIELECTRIC  SHEATH.  IT  IS  FOUND 
THAT  THESE  SOLUTIONS  AGREE  BETTER  WITH  EXPERIMENT  THAN  DO  THOSE  WHERE  NO  SHEATH 
AT  ALL  IS  ASSUMED,  AND  APPEAR  ADEQUATE  TO  ACCOUNT  FOR  ALL  EXPERIMENTAL 
OBSERVATIONS.  FOR  THE  CASE  OF  A  FINITE  OR  THICK  SHEATH,  THE  SOLUTIONS  DISAGREE 
WITH  SOME  OTHER  SHEATH  THEORIES,  HOWEVER,  IN  AN  AREA  WHERE  NO  EXPERIMENTAL 
OBSERVATIONS  ARE  YET  REPORTED. 

THE  DIELECTRIC  SHEATH  ALSO  ADDS  A  SURFACE  WAVE  TO  THE  GROUP  OF 
HYDROMAGNETIC  WAVES,  AND  THE  COUPLING  BETWEEN  THE  SURFACE  WAVE  AND  THE  HYDRO- 
MAGNETIC  WAVES  IS  SHOWN  IN  CERTAIN  FREQUENCY  REGIONS.  ORTHOGONALITY  RELATIONS 
ARE  GIVEN  WHICH  SHOW  THAT  THE  SURFACE  WAVE  AND  THE  HYDROMAGNETIC  WAVES  ARE  ALL 
MUTUALLY  ORTHOGONAL. 


KEY  WORDS:  ANALYSIS,  BEHAVIOR,  BOUNDARY,  COLD,  COMPUTER,  COUPLING,  DIELECTRICS, 

FREQUENCY,  GROUP,  HYDROMAGNETICS,  MAGNETISM,  OBSERVATIONS,  PLASMA, 
SHEATH,  TENSOR,  WAVE,  WAVEGUIDE. 
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